A site reliability engineer, or SRE, is a role that that encompasses aspects of both software engineering and operations/infrastructure. It also encompasses a strategy and set of practices and principles across service offerings and is closely tied to DevOps and operations. The term site reliability engineering first came into existence at Google in 2003 when a site reliability team was created. At that time, the team was made up of software engineers. Since then, the concept of site reliability engineering has evolved and made its way into the broader software development industry and is now its own role within organizations.
Site reliability engineers bridge the gap between operations and software developers. While there is no one size fits all approach to what a site reliability engineer does from organization to organization, broadly speaking, a site reliability engineer’s responsibility can encompass a wide-range of objectives, such as managing and monitoring system availability, latency, performance, efficiency, incident response, as well as capacity planning of an organization’s services. Let us dig into this deeper to understand more about this role and how it functions within organizations.
What is Site Reliability Engineering?
To think about it another way, site reliability engineering is where the traditional IT role, or system administration role, and DevOps meet. In a traditional IT environment, organizations may have had a team of system administrators managing complex systems. The focus and responsibility being to ensure that software is deployed properly and to delivering a reliable service to end users. Furthermore, their role includes managing any issues or that occur post software deployment.
However, system administrators are not focused on the actual software development, which is where development and system administrator roles can be at odds. Developers are focused on producing software and getting it in the hands of users, not necessarily concerned about the aspects or effects of software deployment. It is at this junction where the site reliability engineer role comes in. Site reliability engineers are focused with creating scalable and reliable software systems, so this also includes ensuring that development work is efficient and reliable, so when the finished product is ready for production, there are no surprises.
What Does a Site Reliability Engineer Do?
Site reliability engineering involves splitting time between operations and development. For example, a site reliability engineer may be involved with help desk tickets, on-call incidents, manual tasks, etc. In addition to that, a site reliability engineer may also spend their time on proactive projects, such as automation, improving system reliability, etc., trying to reduce the amount of manual work and ensuring all the components (infrastructure/hardware, middleware, software, etc.) that are required to keep the software deployments live are running efficiently.
What are Some Common SRE Responsibilities?
Actual SRE responsibilities vary from company to company, but for the most part, an SRE or SRE team is responsible for all aspects of their services offerings, and may require one, all, or more than the following responsibilities listed below:
- Capacity Planning
- Incident Response
- On-call Support
So, as you can see, an SRE role tends to be a jack of all trades. One minute an SRE might be provisioning storage in AWS, the next minute an SRE might have to talk to customers or go write some Python code for a new project. It really depends on the day.
What Tools do SREs Use?
The tools and software solutions that site reliability engineers can vary greatly from organization to organization. One of the main reasons being that in larger organizations, there would typically be more personnel within an SRE team, therefore, the responsibilities and scope for each SRE would be divided amongst the team, resulting in a more focused role. In turn, this would also reduce the range of tools and platforms they would use. So, for example, in a larger enterprise organization, an SRE may just work in Jenkins all day, every day.
On the flipside, a site reliability engineering team or individual in a smaller organization may have to wear many more hats, as personnel would likely be limited, therefore, their toolset would have to include everything from configuration management platforms and automated incident response systems to monitoring and analytics tools. You may already be familiar with some of the tools that an SRE uses, such as Docker, Terraform, Prometheus, and Kibana.
Read: Top 13 Site Reliability Engineer (SRE) Tools to learn more about the most popular tools that site reliability engineers use today.
Where Can I Learn More about Site Reliability Engineering?
The term “Site Reliability Engineer” is attributed to Ben Treynor Sloss, now a Vice President of Engineering at Google. He was asked in 2003 to create and manage a team of seven engineers which eventually led him to create the new role/title. There are a few great online resources written by Ben and several other Google engineering team members that cover everything from the principles and tenets of SREs, SRE roles and responsibilities, to the evolution of the Site Reliability Engineering role and where it stands in today’s DevOps environments. No better way to learn more about site reliability engineering than from the individual and organization that created the role in the first place, right?
There is also a great list of Site Reliability Engineering resources located on GitHub.
Conclusion: What is a Site Reliability Engineer (SRE)?
As we have covered, an SRE is more than just your traditional operations or system administrator role. An SRE uses their breadth of experience and knowledge to help automate and create efficiencies across their software services and organization. A good SRE is someone who is, by and large, an excellent problem solver. They do not have to necessarily be the expert in everything they do, but they must have a grasp on many different disciplines and know what steps and techniques to carry out when issues arise. They also have to understand how different roles within their organization work together in order to effectively carry out tasks and projects. It is like constantly putting together a large, complicated puzzle. It can be very frustrating and demanding sometimes, and pieces can sometimes go missing, but once you have finished it, there is a great deal of pride and accomplishment.
As part of the responsibility of an SRE, monitoring and observability are a key component of their duties. The synthetic monitoring solutions from Dotcom-Monitor allows SREs and DevOps teams to simulate and monitor users through a system or service. The Dotcom-Monitor platform allows SREs to set up customized monitoring alerts and integrates with incident and alerting platforms like PagerDuty, VictorOps, AlertOps, as well as many others. Furthermore, SREs can view real-time dashboards, access reports, and review analytics to quickly identify performance issues. It is vital for SREs and teams to continually monitor the health of applications and infrastructure to ensure to understand reliability, accessibility, and overall performance of their infrastructure.
Learn more about Dotcom-Monitor and how you can use the platform to go deeper into monitoring and observability to gain better insight of your applications and infrastructure.