Using an alert manager to respond to critical IT production incidents
If you’re an IT operation manager or a devops manager, one of your duties is to make sure you have a well defined process to respond to production incidents. If not, I can guarantee you will not sleep well!
Non-critical incidents tend to be issues that can be investigated in a couple of days. However critical incidents require someone to act immediatly to mitigate the negative impact on your business. If you’re operating a SaaS, typical critical incidents are: website unavailable, all logins fail, near crash situation, data leaks, system intrusion, denial of service, highly suspicious activities…
To respond to critical incidents properly, you must setup an alert manager (AM). If not, incidents will eventually generate much more loss than the effort it takes to respond fast.
Here are several features you’ll need from an AM:
- Automatic trigger
If you already have automatic production incident detection then you will need to integrate them with the AM so it triggers alerts automatically. - Triage
You need to assess alerts properly. Generate too many critical alerts and they will not be taken seriously anymore in the long term. Employees will eventually experience oncall fatigue. An AM must remove noise before anything else. - Find the right person to reach out
You can’t just contact a random member of the team, you probably have an oncall rotation schedule to follow. An AM stores schedules in a calendar and phone numbers for you. - Wake you up
Forget text messages or emails. An AM must call a phone number. The device must use a loud ringtone. - Call rotation
If a call is missed, you might decide to have a rotation strategy in your AM to fallback to a 2nd or a 3rd phone number. - Status follow-up
An AM must make sure someone has acknowledged and then closed the alert. - Alarm drills
If you’re good at managing your infrastructure, you should not be receiving alerts too often, therefore you must setup alarm drills in your AM and make sure everything works well.
There are many ways you can customize an AM, here’s one very typical response you can make:
You will find plenty of AM products on the market. The most popular ones are PagerDuty, OpsGenie or xMatters. I’ll be writing posts about AM products comparaison so make sure you follow me!