Don’t Fear the Beeper

Derek Power
7 min readJan 17, 2022
Photo by Volodymyr Hryshchenko on Unsplash

On-call is, sadly, part and parcel of some roles in IT. Historically it has been mainly the Network Operations Centre (NOC) and Operations teams that participated in an out-of-hours rotation, with Site Reliability Engineering (SRE) taking up the mantle in recent times. But On-call, when done properly, does not have to be something feared and hated. It can, in fact, been a useful improvement and learning tool.

I still remember my most annoying on-call. I was the escalation engineer and got a call at five minutes past eight on Christmas morning from the on-call engineer. They had seen a spike in 20x response codes (one of the metrics we monitored) and wanted to know how to correct it.

Let’s just say the choice words I had when explaining that our platform ‘performing better’ was not something we generally ‘fixed’ is not suitable for repeating here. But it did spark the idea in my head about how on-call should be done.

It should be something people almost want to volunteer for (extra pay and time-in-lieu not being the only reason). When managed properly it can even be an ideal training tool.

Education

On-call rotations can be ideal educational situations for engineers to truly get to grips with the platform they help create and need to support. Being paged late at night because something has broken is equal parts annoying and scary, but it is also an ideal time to see just how much (or little) you know about the platform. You get a chance to investigate the problem, see if you know how to find the solution and then resolve the issue. A properly staffed on-call rotation will also have an escalation path, so that newer members have a bit of a safety net if they cannot fix the problem and clear the alerts.

The escalation path, typically a senior level engineer, is a good litmus test for your on-call on-boarding. How often does the first engineer have to escalate? Are the same issues being escalated each and every time? If so then possibly the on-boarding for on-call is missing some details. Perhaps the playbook (the steps required to resolve the issue) needs to be reviewed and updated as it is not clear.

Ideally you want that escalation path to be an ‘in case of emergency’ only engineer, not somebody who is going to risk suffering ‘on-call fatigue’ by being called by the on-call engineer every time something goes beep in the night.

Then you have the actual on-call engineer themselves. If you have one that is resolving an issue quickly but another takes longer to resolve the same issue perhaps there is some internal training and knowledge sharing that needs to be done. After all, you want all on-call engineers to be equal in terms of skills — otherwise management start to get into the bad mindset of ‘I’d prefer if that engineer isn’t on-call during our busy period.’.

If there is any knowledge that is not in documentation and playbooks, but is in the head of certain engineers, then the next working day should have an ‘update documents’ task for the on-call engineer to work on. Capturing that information so everyone has access to it, making the on-call experience better for everyone.

Improvements

As mentioned above, one of the selling points about participating in an on-call rotation is an educational one. So you can see where you may have gaps in your knowledge of the platform or spot where the training may need to be improved. But there is another huge benefit for a company when their engineers are involved in on-call: improvements.

Nothing sparks the creative juices in an engineer’s mind more than when they get woken at 3 am by an alert that probably did not need to fire. While they fix the issue they can start to think about how to avoid the alert triggering in the future. Auto-scaling, maybe self-healing scripts that are automatically called before the alert needs to wake a person. You would be surprised at how hard it is to think of these solutions in the cold light of day while doing your normal work.

Yet when your sleep is on the line an engineer can come up with some elegant solutions. Solutions that will improve the stability of the platform, result in better user experience and, most importantly, reduce the amount of alerts that get a person out of bed.

This isn’t to say that every alert which fires will result in work that reduces on-call toil, sometimes an alert really is firing because of a genuine problem. But without being on-call it is hard to tell which is which.

Expansion

This part is important to have an on-call rotation that people are not fearful of — it needs to expand beyond just the NOC, SRE and Operation teams. The old management mindset that these teams should be able to handle every out-of hour alert is not only outdated, it was never correct to begin with.

One of the main metrics that measures if your on-call rotation is working to the best of your user experience is the MTTR (Mean Time To Resolution). To ensure that this is as small a time window as possible you want the alerts waking the right people. Why would you expect your SRE team to be woken for a problem that is application related? Likewise why would your developers get out of bed to fix a space issue on a node? Back in the old days management did not think like this. SRE/Operations/NOC teams were expected to be masters of all.

Effectively expecting your on-call engineer to be an entire department when it came to the skills they needed.

But if you invest time in your on-call rotations, ensuring that people are on-boarded correctly and that alerts which wake folk are ones that definitely need to have a human involved, you can expand the on-call burden beyond just your Operations Sphere. You can bring in developers, so that when application specific alerts fire we wake the right person for the job.

This approach, however, cannot be a half measure. One company I worked in had developers participate in an on-call rotation, but they were purely an escalation path for the SRE team. The expectation was that the on-call engineer from SRE would act as a swivel chair, getting woke up by the paging tool and then calling the developer that was on-call. Nobody in the rotation should be a swivel chair if you want to keep that MTTR value as low as possible. Have the right alerts wake the right people and along with education your on-call rotations will be well oiled machines.

Leadership Adjustment

A phrase I regularly use about monitoring is that it needs to be organic, not something that is done once and forgotten about. It should be work that is revisited regularly to ensure it does not go stale. This same approach applies to the on-call rotation.

One of the worst on-call rotations I inherited from another manager was when the previous manager had suggested that ‘everything should be alerting’ regardless of a true rhyme or reason behind what woke somebody up or not. This resulted in nearly two hundred alerts firing a night, with at least 100 of these waking the on-call engineer. On-call fatigue was not just present with the team, it was causing people to genuinely consider leaving the company. And it had been this way for a number of years. Working with the NOC manager we started having weekly reviews of the alerts that woke people up during the week and, along with the on-call engineer for that week, we trimmed the fat. Alerts that made no sense but had still gotten somebody out of bed. After the first month we reduced the number of pages down to an average of ten a week.

If you’re hoping for your engineers to come up with self-heal scripts and auto-scaling solutions ten pages is much more likely to get you those things as opposed to a hundred.

After our initial ‘pager purge’ we kept having the weekly review meeting, bringing in the on-call engineer each time, and started to review the pages based on the GEO (we had customers all over the world) and if the self-heal approaches were given enough time before waking people up.

Inside of a year we had the alerts down to single digits a week and people actually volunteering to get into the rotation. We had shown that management knew the fatigue was real but also were willing to do something to address it.

While also not causing any problems for customers.

No alert that was removed or page that had its times adjust resulted in any increase in poor experience for our customers. If anything we managed to provide more up-time for customers because engineers now had time to think about improvements, but also were not too tired to work on those same ideas.

The point is, your on-call rotation should not be a thing that engineers fear being involved with. As a manager you can use it as an educational tool that benefits both customer and company.

--

--

Derek Power

Head of Cloud Infra by day, gamer by night, author of a comedy-fantasy series called ‘Filthy Henry’ by twilight — Trust me, I always lie.