On Call

I’m currently doing my second rotation as the on-call engineer supporting our software, and although this isn’t the first time I’ve done software support, it’s a different beast altogether to support software in an organization such as AWS.

When a support ticket gets into your queue, and that ticket is marked as being of high severity, you pay attention to it. There’s a huge potential blast radius of any bug or issue, so it’s definitely a good idea to act decisively, making sure not to hit the panic button on the way though.

That’s not to say that every ticket marked as high severity is indeed important: after all, there is often a panicked human user desperately trying to hit a deadline behind those tickets, and often that comes with a bit of desperation. You might be the smartest engineer in the company, but in the face of a crisis, you wouldn’t necessarily be thinking clearly, and often you too would hit the panic button – you’d fire off a high severity ticket to get yourself unblocked, since you have no idea how else to solve whatever’s stopping you from doing your job.

Our users are all internal to AWS, so we don’t deal with external customers – all our customers are also AWS engineers, and the tickets in our queue are quite detailed enough that it’s easy for the on-call to assess what’s happening, or where to look.

That being said, we’re all human: again, when you have job to do, and the tool you’re using isn’t working the way you think it should work, you file a ticket (if not submit a patch to fix it).

The one thing that I appreciate though with doing operations is getting to be in close contact with the sharp edges of the software we write – the tickets in the operations queue are often the things that make the software we’ve built hard to use, and for every panicked engineer who’s filed a ticket incorrectly marked as high severity, there’s a bit of user-hostile behaviour that needs fixing.

Alternatively, the whole act of supporting software gives you insight into how the software works – which sounds counterintuitive, given that it’s presumed that you understand how it works in the first place to support it. It’s rare that you’ve written every single bit that goes into the software you maintain; unless you’ve written the whole stack from the firmware up, there will be things you’ve not written that the software that you have will interact with.

Even the software that your team writes isn’t all written by the current members of the team. After all, a large chunk of the time we spend as programmers and software developers is maintenance, and often it’s code we didn’t write in the first place.

If it’s not yet evident, I’m quite an advocate for the whole idea of devops, as it exposes engineers to the hidden assumptions at play when we write software. Often, we can’t dogfood our own software, and in that case supporting our software provides the feedback that dogfooding would normally bring.

The operations queue then would act as a sort of indirect way of observing user behavior.

It’s important that engineers experience operations as soon as possible in their careers, not necessarily unguided though, as often for the sake of getting things done, you would want another more experienced teammate assisting. But close contact definitely makes a difference, I think, and I’m glad I volunteered to get into the on-call rotation.

Previously: Locked Out