Matt Rickard

Share this post

Why Do Configuration Changes Cause Outages?

blog.matt-rickard.com

Discover more from Matt Rickard

Thoughts on engineering, startups, and AI.
Continue reading
Sign in

Why Do Configuration Changes Cause Outages?

Nov 22, 2022
Share this post

Why Do Configuration Changes Cause Outages?

blog.matt-rickard.com
Share

From a glance, a good percentage of outages are caused by bad configuration changes – the 2021 global Facebook outage, the $440mm bad configuration that brought down Knight Capital in 2012, numerous global outages at Google Cloud, Microsoft Azure, Cloudflare, and other companies with serious engineering cultures.  Why do configuration changes cause so many outages?

  • Configuration breaks production/development parity – ideally, services in development and production are deploying similar code and have similar but separate infrastructure. No matter how close you get to production, there will always be the configuration that needs to be different (e.g., deploying to development.matt-rickard.com vs. matt-rickard.com).

  • Some configuration isn't testable – The Facebook outage was caused by bad BGP configuration. Internet configuration is often the hardest to test – how to do emulate BGP routes, DNS, or other global infrastructure in a way that mimics the real world. Much like models, all staging environments are wrong, but some are useful.

  • Every sufficiently advanced configuration language is wrong. There are significant problems that can occur in the configuration itself – most configuration languages are untyped, templated, manually duplicated, or synced by hand.

What helps prevent outages due to bad configuration?

  • The SRE book by Google touches on some practical properties of safe configuration changes – (1) gradual configuration changes, (2) rollback-safe changes, and (3) automatic rollback or canary deploys.

  • Infrastructure-as-code likely helps. Configuration that is code is easier to test, validate, and deploy. This usually means that configuration is owned by the right teams and goes through the right reviews (e.g., code reviews).

  • Limited blast radius for configuration – control planes that are cleanly separated from data planes and configuration with clearly associated services.

  • Immutable infrastructure – configuration that goes through multiple changes is likely to end up in a state that is hard to emulate from scratch in test environments.  

But configuration change outages are anything but a solved problem.

Share this post

Why Do Configuration Changes Cause Outages?

blog.matt-rickard.com
Share
Previous
Next
Comments
Top
New
Community

No posts

Ready for more?

© 2023 Matt Rickard
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing