Failure Modes

I write software, and I know I’m not perfect: nobody is. I often make mistakes, or often overlook certain cases when thinking of how to implement a feature, so inherently, there will be bugs in the code I write. I try my best not to introduce bugs, but nevertheless bugs will be introduced somehow. On the good days, these bugs will be obvious and easy to fix; on the bad days, they’ll be subtle and insidious.

And sometimes, the bugs are a side-effect of the sheer complexity of the software we write, or of the interactions between the disparate systems that we build.

I’ve been reminded of this quite a bit recently. The software my team owns is responsible for a key bit of how we manage network devices globally; there are many bits that our software needs to interact with to get its job done, and we tend to interact with several teams, especially when things go awry somehow.

That’s not to say that our software is somehow broken and buggy: on the contrary, we sometimes find ourselves in a situation where the failure modes we end up in are novel and surprising, and often times, a failure is not the result of solely one software system in the chain.

We engineer our software to at least be resilient in the face of failure – failure is inevitable in any distributed system. But, sometimes, the failure modes are so far removed from the level you work at: for example, a hardware issue in some device somewhere, manifesting as your software system failing in a way that is at first blush weird, but upon closer inspection, and tracing the links of the chain, makes sense entirely.

It reminds me of Connections, a documentary series by James Burke in the late 70s (and revived in the 90s): sometimes, some event or invention in the past poses suprising importance to the present, and only through tracing the links in that chain of events would it be apparent the contribution of that one thing in the past to today. In the same way, some logical flaw in some dependent system somewhere, triggered in a particularly unusual set of circumstances, coupled with the nature of the beast, and the assumptions we’ve built into it, end up causing problems whose root cause is not directly evident until you tease it out.

In all honesty, the systems I’ve built and worked with thus far have never really shown such interesting behavior. They’ve been fairly self-contained, and any such failure modes were limited to rarities that would almost certainly never come to pass in the lifetime of the software. But, when you’re talking about the massive scale that AWS is at, it’s almost inevitable that the rarities become less rare than expected. It’s refreshing and a bit exciting, weirdly enough, and I find it shifts my thinking a bit into thinking about how things will fail, and therefore building against it somehow.

Previously: Comparison