Complexity and the 4 a.m. test
With most technology, it’s a given that there’s almost always More Than One Way To Do It (unless you worship Python). There are always those situations where choices must be made, and different people use different yardsticks to decide. Some try to minimize “cost,” either up-front development cost or long-term engineering cost. The smarter ones have recognized the concept of “Technology Debt” as addressed by several observers. As a leader in Operations, however, I tend to subscribe to my own rule: the 4 a.m. rule.
Simply put, the 4 a.m. rule is this:
Never adopt any solution which you couldn’t understand immediately upon being awoken to fix it at 4 a.m.
There’s a very simple reason to adhere to this rule whenever possible — as I’ve previously mentioned, things fall apart. Systems all break: complex ones and simple ones alike. Sooner or later, people need to fix them and the more byzantine the operation of the system, the harder it will be.
The simplest way possible to survive the 4 a.m. test is to only build very simple systems. A totally simple system is sometimes just the ticket to solve the problem, and where it is adequate, go with it. Interesting problems occasionally have extremely elegant solutions, and making them more complex is just bad mojo.
Still, you’ll much more often find a place where more complexity is necessary to achieve your desired goal. In these circumstances, it can be tricky to pass the 4 a.m. test. This is where two strategies are necessary: documentation and transparency.
Documentation deserves a whole separate discussion, but the part that’s important at 4 a.m. is a complete lack of subtlety:
- Recovery instructions: you’ll have bleary eyes, so these must be as simple as “if this, do that”
- Architecture diagrams: simple pictures with bright colors and clearly labeled lines detailing what talks to what and why. And don’t make me load Visio at 4 a.m. ever.
- If it’s needed, and can fail, it should be mentioned, but in as few words as necessary. This is not the time for flowery prose.
Transparency is quite a bit harder. This is about exposing as much as possible to someone observing the system. A few places are crucial:
- Error messages: For the love of god people, make sure every message requires absolutely no prior knowledge and is clear and unambiguous even out of context.
- Simple dependencies: Nothing is harder to discover than extremely complex webs of services. If you ever see an design with recursive dependency, run like heck.
- Change logging: the first question you should ask when something is broken is “what changed.” Keep a record of even the boring stuff - you never know when it’ll save your bacon.
Remember as a cardinal rule:
complexity is a vice: use it sparingly and explain it simply enough for 4 a.m.
Tags: 4am test, complexity, outages
September 15th, 2008 at 5:47 am
I noted you didn’t pick the magic, overused in recent past “3 AM” designation :).