In one of our previous articles, we discussed what an SRE is, what they do, and some of the common responsibilities that a typical SRE may have, like supporting operations, dealing with trouble tickets and incident response, and general system monitoring and observability. In this article, we will take a deeper dive into the various SRE principles and guidelines that a site reliability engineer practices in their role. Like DevOps, these SRE principles serve as a guide to drive alignment as it relates to aligning, meeting, and supporting the goals of the organization.
Google was the first company to create, embrace, and put support behind the role of site reliability engineering. Since that time, the SRE role has evolved as the industry has changed and shifted from the traditional monolithic structures to large, widely distributed networks and microservices. However, one thing has largely remained the same – the principles by which SREs adhere to. These core SRE principles are focused on one thing: driving system and service reliability. Let us take a deeper dive into these core SRE principles.
Embracing and Managing Risk
Embracing risk is first of the SRE principles, and for good reason. In order to improve a system’s reliability, it is important to gauge the impact of “what if” failures. It is understood that no system is 100 percent reliable. At some point in time, something is going to go wrong. Unfortunately, the everyday user or customer does not know, or care, to be so understanding. And there is an inherent cost associated with ensuring reliability. Whether that means it is a financial cost, time cost, or simply a customer’s confidence in your services.
An SREs responsibility is to lean into failure and risk in order to learn how they can ultimately make their services and systems more resilient. However, there are tradeoffs that need to be considered. For example, ensuring maximum reliability may come at the cost of being able to more rapidly deploy future services. Or maybe further improvements do not necessarily mean a substantial gain in revenue? The goal is to make a reliable system, but no more than it needs to be, as the cost and time associated with doing that outweighs the potential benefits.
Service Level Objectives
The principle of embracing risk is closely tied to service level objects, or SLOs. To go a bit deeper, SLOs are the formalize set of objectives within a service level agreement (SLA) that are measured against service level indicators, or SLIs. SLIs are the actual performance metrics of your services. For example, if your SLO states that your uptime must be 99.9%, the actual SLI must meet or exceed that performance metric in order meet that specific SLO. The SLIs are the indicators that an SRE would continually monitor, so that if it is ever out of that threshold, teams are alerted and the issue can be resolved quickly. SLIs are really tied to the user for determining what is most important to their experience as it relates to the service.
SLAs vs. SLOs vs. SLIs
- SLAs. Agreements made with your clients or customers that define the level of service that is going to be delivered.
- SLOs. An agreement within the SLA that states specific metric, like uptime, response time, security, issue resolution, etc.
- SLIs. The actual performance, or measurements, of your SLOs that determine the level of compliance.
The SLOs used to measure actual performance against the SLA, which are the agreements between a service provider and the client. Again, this all goes back to the idea that there needs to be an agreement or understanding about how much risk, or tolerance, can be allowed for a given service.
Read: Learn more about managing SLA compliance within your organization.
Toil, as it is defined with the scope of the SRE role, is the amount of manual work that is required to ensure services are running. One of the main goals of an SRE is to automate as much work as possible. This allows SREs to open up more time for more important tasks. And when you think about it, reducing toil should really be a part of anyone’s job. The less time needed on redundant tasks is ensures better productivity in the long run. Any time a site reliability engineer must engage in repetitive manual activities, as it relates to managing the production service, this can be described as toil.
In a lot of instances, there may be occasions where an SRE has to carry out manual, time-consuming activities, but not all of them should be defined as toil. However, it is key to define which activities within the SRE team are consuming the most time. From there, identify where improvements can be made to reduce the amount of toil for better work balance. When Google first introduced the role of SRE, they set a goal that half of an SREs time should be focused on reducing future operational work or adding service features. Developing new features correlates with improving metrics like reliability and performance, which ultimately reduces potential toil down the line.
At Dotcom-Monitor, we are all about monitoring solutions for tracking uptime, availability, functionality, and all-around performance of servers, websites, services, and applications. Monitoring is one of the most important SRE principles within the role. Continuous monitoring ensures that services are performing as intended and can help identify the moment issues arise so they can be fixed immediately. Like we mentioned in the previous section, meeting those SLOs are key to the defined business SLAs, and ultimately, users. Monitoring can provide SREs and teams with a historical trend of performance and can offer insight to what is a one-off issue versus a wider, systemic problem. As defined by the Google SRE initiative, the four golden signals of monitoring include the following metrics:
- Latency. Latency is the amount of time, or delay, a service takes to respond to a request. Clearly, slow response times will affect the perceived user experience. Monitoring can provide a way to differentiate between
- Traffic. Traffic refers to the amount of user demand, or load, is on the system. This can be measured by HTTP requests per second or depending on the actual service
- Errors. Errors refer to the rate at which requests to the service fail. However, it is important for SRE teams to differentiate between hard failures, like 500 server errors, and soft failures, such as a 200 OK response that timed out because a specific performance threshold was set. It is important to consider how to appropriately monitor these different scenarios like these.
- Saturation. Saturation is about measuring how much system resources a given service has. Up to a certain point, most services will experience performance degradation. Understanding where this occurs can help correctly define monitoring objectives and targets, so corrective action can be carried out.
Automate, automate, automate. We touched on this principle earlier when we discussed reducing toil, but it cannot be understated. The nature of the SRE role is as diverse as a role can be. In order to reduce the potential for manual intervention across all facets of their responsibilities, automating tasks is key to a successful business. As services scale and become more distributed, it becomes much harder to manage. Automating repetitive tasks across the board, whether it is testing, software deployment, incident response, or simply communication between teams, automating provides immediate benefits, efficiencies, and most importantly, consistency. Since the time the SRE role was conceived, there has been a shift in how development, QA, and Operations teams collaborate. To support these new DevOps environment and practices, various platforms and tools have been developed.
Release engineering. Sounds like a complex subject, but in reality, it is just a simple way to define how software is built and delivered. While release engineering in itself is its own title and role, within the concept of SRE, this means delivering services that are stable, consistent, and of course, repeatable. This goes back to the previous section about automation. If you are going to do something, do it right AND be able to repeat that again, in a consistent manner, as necessary. Building a bunch of one-off services is time-consuming and creates unneeded toil.
If we go back to the history of the SRE position at Google, they had dedicated release engineers who worked directly with SREs. Release engineers are typically tasked with defining best practices as it relates to developing software services, deploying updates, continuous testing, and addressing software issues, in addition to many other responsibilities. This role becomes more critical when you think about how to scale services and deploying them quickly. Having a set of best practices and tools (and enforcing them) is essential to being able to meet these demands and gives peace of mind to SRE teams once that build is put into production.
With a position that has seemingly no end to the number of responsibilities and expectations like the SRE role has, the last principle, ironically, is simplicity. Maybe easier said than done in practice, this principle focuses on the idea of developing a system or service that is only as complex as necessary. While that may seem counterintuitive at first, it really boils down to wanting a system that is reliable, consistent, and predictable. That may sound boring, but to an SRE, that is one of the ultimate end goals.
SREs strive for a system or service that is not complex or difficult to manage. SREs want one that simply does the job that it was designed to do. However, from a user’s perspective, a service that provides a lot of features may also provide a lot of benefits, but to a SRE, that just means more potential headaches. However, change is always inevitable if you want to add new features to a web service, do so thoughtfully. Smaller, incremental changes are easier (and simpler) to manage than building out and shipping a lot of features at one time. SREs also has to consider the needs and goals of the business as well.
SRE Principles: The 7 Fundamental Rules – Final Thoughts
The SRE role focuses on building, delivering, and maintaining reliable systems and services at scale. These seven core principles of help define the practices for SREs that help drive alignment within DevOps practices and support the goals of the business. It is a complex role that seeks to balance reliability with feature releases, all while maintaining exceptional levels of quality.
The Dotcom-Monitor platform provides SREs with all the monitoring features they need ensure continuity of their services. From configurable alerts and reports to real-time dashboards and reports, the platform provides the essential tools required to manage performance of all their services for the long-term. For example, create web application scripts based on user behavior, actions, and paths and set up synthetic monitoring tasks to ensure a consistent experience over time. No matter the level of monitoring your team requires, there is a solution to meet your needs.