The service tier for a system is defined by the owning delivery team and the product managers responsible for that system. In terms of support Operations Support provides 24×7 support for all systems regardless of service tier, however, we should only be calling out for systems that are designated Platinum or Gold (grey area I know – there will always be the odd exception).
This denotes what level of impact an incident is having, we use this generally when reporting on our incidents in things like Looker.
P1 incidents are the type of incidents that are having a major impact such as a website being down or people not being able to publish content.
P2 would be something that is badly degraded maybe some people could not subscribe whereas others could – take the recent incident of subscriptions now working in the US as a good example of a P2.
Please note we should not categorize incidents as P1 or P2 in Response unless something is down or severely degraded.
Note that if an alert comes in or appears on Heimdall and says it’s a Severity 1 it does not mean it is a Priority 1 incident, they are different things!
All alerts need to be investigated on their own merit and ideally, there should be business impact before we create an incident in response – this is all part of troubleshooting and triaging.
The Severity levels in Response are:
The severity of an alert represents how serious the alert is and how fast Operations Support should either fix it themselves or get a delivery team to do so.
The alert might be notified that something is already down/broken or will be if action is not taken straight away. Most of the time severity 1 alerts are from systems like pingdom and are already telling us something is broken.
All alerts that come into Operations Support need to be looked at and assessed as part of the troubleshooting and triaging process. We are 24×7, so if a Severity 2 alert comes in that we can resolve then we should do so.
So in summary……
Severity 1 – Fix or escalate immediately (Callout may be necessary – depending on Service Tier). Generally, something is broken or will be very soon if action is not taken.
Severity 2 – Treat as a priority, fix or escalate ASAP. If support is required a delivery team then gets them to deal with the alert as soon as someone is available. Remember to notify them and advise what you have done already. This might be done in Slack or via a Jira ticket.
Severity 3 – Assess alongside other activities, either deal with the alert ourselves or get the delivery team to take a look during normal support hours (not urgent).
Note: You can have severity 1 and something down but still not callout because the service is bronze – on the flipside you could have a severity 1 and something not be down on Platinum service and have to callout if the alert instructions or the runbook says something must be done.
No alerts should be ignored unless they have been previously triaged by us.
For a more detailed discussion, check out Severity Levels for Technology Incidents.
If anyone needs any guidance or has any questions please let Graham Jackson know.