Reflections on 10,000 Hours of DevOps

Mar 31, 2023

Some reflections after putting 10,000 hours into DevOps engineering.

From my early adolescence doing sysadmin work, customizing my Arch Linux installation, to running a server in the closet of my college dorm (narrator: it was loud, and my email rarely delivered), to working on open-source DevOps at Google — I’ve probably put in many more hours. It’s hard to tell how many of those counted as Malcolm Gladwell’s “deliberate practice,” but these are the lessons learned nonetheless. (Also see my more general reflections on 10,000 hours of programming).

Reproducibility matters. Without it, these subtle bugs burn hours of debugging time and kill productivity.
Never reuse a flag.
The value of a CI/CD Pipeline is inversely proportional to how long the pipeline takes to run.
Code is better than YAML.
Linear git history makes rollbacks easier.
Version your APIs. Even the internal ones. No stupid breaking changes (e.g., renaming a field). Don’t reinvent the wheel. Use semantic versioning.
Do not prematurely split a monorepo. Monorepos have U-shaped utility (great for extremely small or large orgs).
Vertical scaling (bigger machines) is much simpler than horizontal scaling (sharding, distributed systems). But sometimes, the complexity of distributed systems is warranted.
Your integration tests are too long.
Have a high bar for introducing new dependencies. Especially ones that require special builds or environments.
Release early, release often.
Do not tolerate flaky tests. Fix them (or delete them).
Make environments easy to set up from scratch. This helps in every stage: local, staging, and production.
Beware toolchain sprawl. Every new tool requires expertise, management, and maintenance.
Feature flags and gradual rollouts save headaches.
Internal platforms (e.g., a PaaS) can make developers more productive, but make sure you aren’t getting in the way. Only create new abstractions that could only exist in your company.
Don’t use Kubernetes, Yet. Make sure your technology's complexity matches your organization's expertise.
Cattle, not pets (prefer ephemeral infrastructure over golden images). Less relevant in the cloud era but important to remember.
Avoid shiny objects but know when the paradigm shifts.
Technical debt isn’t ubiquitously bad.
Meaningful health checks for every service. Standardize the endpoint (e.g., /healthz) and statuses.
80/20 rule for declarative configuration. The last 20% usually isn’t worth it.
Default to closed (minimal permissions for) infrastructure.
Default to open for humans. It’s usually a net benefit for developers to view code outside their own project.
Bash scripts aren’t as terrible as their reputation. Just don’t do anything too complex. Always “set -ex” and “-o pipefail.”
Throttle, debounce, and rate-limit external APIs.
Immutable infrastructure removes a whole class of bugs.
Makefiles are unreasonably effective.
If you have to do a simple task more than 3 times, automate it.
Be practical about vendor lock-in. Don’t over-engineer a generic solution when it’s incredibly costly. But proprietary solutions have a cost (developer experience, customizability, etc.)
Structured logging (JSON) in production, plaintext in development.

Matt Rickard

Discussion about this post

Ready for more?