In the world of a site reliability engineer (SRE), failure is not only an option, but also expected. Systems, web applications, servers, devices, etc., are all prone to performance issues and unexpected outages at some point. It is an unavoidable fact. These unexpected failures can lead to huge revenue losses, customer trust and depending on the industry, maybe fines. Fortunately, SRE incident management is one of the core practices used to limit the disruption caused by unexpected issues. In a different article, we talked about chaos engineering and how SRE teams proactively seek out and test for failures to prevent the worst from happening. However, as we are all aware, issues can slip through the cracks. The goal is to prevent these incidents from becoming large-scale cascading failures. SREs and DevOps teams can use these incidents to build back better and improve their systems and services.
What is an Incident?
Before we dig more into this topic, first we must discuss what an incident is. Where is the line drawn between something that requires immediate action versus something that can be investigated later? If every issue were classified as urgent, no one would get any resolution. In the context of IT (Information Technology), an incident is simply an event or issue that disrupts normal operation or quality of service. It has not resulted in a failure, but if left unchecked, has the possibility to cause greater impact to your services and operations. And they usually happen at 2:00am while you are blissfully asleep and get awoken by the sound of your phone going off. We are kidding of course, but you know something is bad if happens that early in the morning. Nothing good happens at 2:00 a.m., especially when we are talking about the IT industry.
What is Incident Management?
Now that we have talked about what an incident is, incident management is the process by which teams resolve these events and bring systems and services back to normal operation. We should also note that incident management is just one element of a larger concept known as IT Service Management, or ITSM. ITSM defines how teams design, create, and deliver their services. It is much more than just IT support. ITSM is the policies, processes, and structure behind the lifecycle of IT services. ITSM is one of the practices of the Information Technology Infrastructure Library, or ITIL.
ITIL provides the framework and guidelines for building out ITSM solutions. You may already be familiar with other frameworks, like Business Process Framework (eTOM), Control Objectives for Information and Related Technologies (COBIT), FitSM, ISO/IEC 20000, and Microsoft Operations Framework (MOF).
The IT Service Management (ITSM) Framework
If we step back and just focus on the elements within the ITSM framework for a bit, there are six other components that make up the ITSM “wheel” along with incident management. While we will not go into detail about these, but it is important to understand how all these pieces fit together along with incident management.
The IT service catalog is typically a database or resource that an organization creates to provide users with information on their operational services and offerings. These service catalogs provide useful information about current and planned services, as well as pricing, purchasing process, points of contact, and other deliverables.
The service desk can be thought of as the point of contact between the service provider and users, such as internal employees, stakeholders, or customers. It is the central “hub” where users go to get assistance and service. By ITIL definition, the service desk may take the form of incident resolution or service requests, but whatever the case, the primary goal of the service desk to provide quick and efficient service.
When we talk about incident management, an SRE team may be able to quickly resolve an incident, but the underlying problem may still exist and persist for a while longer. Problem management is the process by which the root causes of incidents are permanently fixed, which improves long-term performance and future service deployments.
Any type of change, whether we are talking about new service deployments or personal change, there is always an element of risk. Change management is the process of determining how changes will affect the service deployment and/or consider the effects on the business itself. Change management is also sometimes grouped with release management.
You cannot virtualize everything…yet. Software services still require physical devices and hardware for them to function. And organizations need to track, manage, and continually update these devices to ensure their services can run smoothly. Asset management is also referred to as IT asset management, or ITAM.
Knowledge, Policy, and Procedure Management
The goal of knowledge management is to reduce redundancy in terms of collecting, reviewing, and sharing information within an organization. This helps to improve efficiency and ensures that information is consistent, up-to-date, and available.
Incident Management Lifecyle: Process and Steps
An organization’s response to an incident, whether we are talking about downtime, security breaches or cyber-attacks, or even prolonged latency and repeated errors, is critical to the continued success of the business and trust from the customer or end user. SREs must manage complex distributed systems. While the benefits of these systems are that they are more reliable, scalable, and fault tolerant, this also makes them extremely complex, which can result in longer remediation times as issues are harder to detect and pinpoint. The best SRE incident management teams adhere to a strict incident management and remediation process. While the actual steps and processes may vary between organizations, most follow the same basic path. Let us look at the SRE incident management process and steps.
You cannot fix issues that you do not know. Incident identification begins with some form of monitoring or alerting mechanism. We talked about monitoring distributed systems in a different article and how that pertains to SRE teams. Knowing when and where an error, downtime, or application latency occurs is a critical factor in limiting the impact to users and customers. However, in some cases, an incident will become known through a support ticket, a phone call, or even social media, which is never good news when issues are posted publicly for all to see.
Whatever the method of detection, once an incident has been identified, it should be logged. Incident logging serves multiple purposes. It ensures that there is a formal record that has been submitted and for reviewing incident trends later. If the same, or similar, incident appears repeatedly, it might be an indication of a more complex issue that needs to be addressed. When logging an incident, relevant information is also included, like timestamp, incident description, and who discovered the issue. The more detailed information, the better.
Next comes the categorization of the incident based on factors like severity, urgency, or functional area impacted. Like logging the incident, the more information that is provided can help later when determining the right team or individual to assign to the incident response.
Based upon how the incident was categorized, the next step is to set the priority level. Again, some of these steps occur at the same time, so in some cases, they may be carried out at the same time. Organizations typically use a simple scale of low, medium, or high, however, some incidents may automatically fall into specific categories depending on what is impacted. For example, if the incident is related to an outage, that would automatically fall into high priority.
Incident Response, Resolution, and Closure
The last step is to finally respond and resolve the incident to bring closure. This last step is more of an art form than it is a science. There is no easy button here. It can take several cycles and tries to confirm that the incident is finally resolved. Each try can bring more information and additional theories as to why the incident may be happening. This can also lead to identifying further opportunities where weaknesses may be present. Once the incident has been dealt with, it is time to close the request and respond to the original user that reported the incident.
After an incident response, it is typically a good idea to review the details of the incident in full. This is called an incident postmortem. Determining which incidents require a postmortem are typically decided by the team or organization, however, the reasons remain the same. Postmortems help identify areas that can be improved, identify performance blind spots, and refine your incident response process. A postmortem should summarize all aspects of the incident and include the following elements:
- High-level summary and timeline of the incident.
- Root-cause analysis and source of the incident.
- Actions taken to resolve the incident and which ones were effective or not effective.
- Future incident prevention along with additional information that was discovered.
Postmortems are one of the core rules of SRE culture. In fact, they call it blameless postmortem. The idea behind this concept is that everyone on the team acted with best intentions and no one is to blame for the incident. The focus is on identifying why it happened and how to improve system performance moving forward. Mistakes are a natural part of the industry, so instead of blaming individuals, the focus is on creating a more robust, resilient system so that issues never happen again.
SRE Incident Management: Tools & Services
Today, SREs have seemingly unlimited access and opportunity to a wide range of tools, platforms, and services to help automate and manage their workload. Some of these tools we have already covered in a different article, but we will specifically discuss SRE incident management tools.
Read: Top 13 Site Reliability Engineer (SRE) Tools
Incident, Alerting, & Communication Tools
Incident management, communication, and alerting tools can be some of the most important tools SRE teams utilize. The sooner your team is aware, the quicker the incident can be taken care of. These tools should be utilized along with your monitoring strategy. The Dotcom-Monitor platform integrates with these tools (and more), providing a seamless way to incorporate the tools that your teams may already be using with your monitoring and observability goals.
PagerDuty can help identify and trigger alerts based upon an organization’s specific monitoring requirements. By automating the incident identification stage, teams can reduce the amount of manual oversight and time required to begin the incident management process. The right teams are notified immediately, meaning incident response can happen as soon as possible.
VictorOps, now Splunk On-Call, is an incident automation platform to help reduce the time it takes to resolve incidents, providing SREs and DevOps teams a way to efficiently manage their incident response process. Splunk On-Call can also assist with simplifying on-call schedules and incident escalation policies.
While not a true incident response tool, communication is an important factor during the incident response process. One of the more recognizable and popular chat applications in the market, Slack gives SRE teams the functionality to bring all communications into one dashboard. Great for intercompany communication, Slack can also automate responses and events and even hook into other systems and services.
If your organization uses Office 365, then you are probably already aware of Microsoft Teams. Like Slack, Microsoft is a real-time communication application that offers features like online messaging, video chat, and document sharing.
Another incident response solution, OpsGenie provides teams with the ability to set up and configure automated alerting through groups and filtering mechanisms. Additionally, SREs can manage on-call routing rules and specific escalation policies. OpsGenie also provides features like reporting and analytics so teams can view and track incident response metrics and efficiencies.
Conclusion: SRE Incident Management – Overview, Techniques, and Tools
SRE incident management is critical for keeping systems, applications, sites, and services up and running. Seconds matter, especially when it comes to the user’s experience. In large distributed systems, the smallest issue could cause cascading problems. Proactively setting up the right alerts and notifications can be the difference when issues happen and ensuring the impact to users is limited. For more information about how the Dotcom-Monitor platform integrates with these incident management tools, please visit our Knowledge Base.
Try Dotcom-Monitor free for 30 days and get access to all the solutions, integrations, and features within the platform.