Monitoring Distributed Systems

There was a time when standing up a website or application was simple and straightforward and not the complex networks they are today. Web developers or administrators did not have to worry or even consider the complexity of distributed systems of today. The recipe was straightforward. Do you have a database? Check. Do you have a web server? Check. Great, your system was ready to be deployed. Once the system was deployed, to ensure everything was running smoothly, it only took a couple of simple checks to verify. Was the database running? Yes. Is the web server running? Perfect. The last item to check was if the web server was able to talk to the database? Awesome, everything is running as expected. Onto the next project.

Ok, so maybe that is an oversimplified example, but you get the point. Setting up and monitoring these systems was pretty easy compared to today’s standards. There were no dynamic web applications or complex user scenarios to have to monitor. For basic and simple websites, a developer was able to easily automate these checks and fix any problems before a user encountered them. While there are more options when it comes to creating a website, the complexity of the environment has expanded. Gone are the days of monolithic architecture. With the rapid advancements in web application technologies, programming languages, cloud computing services, microservices, hybrid environments, etc., monitoring distributed systems becomes much more difficult to carry out and manage. So much so that it has become a full-time, dedicated job role within organizations.

 

What is a Distributed System?

By definition, a distributed system is any system that comprises of multiple components on variety of machines that work together to appear as a single, organized system. Although the definition may seem straightforward, in the real-world, a distributed system is one of the most complex environments to understand, manage, and monitor. This complexity is “hidden” to the end user, like how an API (Application Programming Interface) operates, whether that is an actual user or another computer.

These systems can include physical servers, containers, virtual machines, or even a device, or node, that connects and communicates with the network. Because of this, even the smallest components can cause far-reaching issues that are hard to pinpoint and troubleshoot. Within an organization, the responsibility of monitoring these large distributed systems typically falls on site reliability engineering (SRE) teams.

 

Types of Distributed Systems

When we think of a system’s architecture, the first thing that may pop into your mind is the traditional client-server system, where a server was the shared resource among many different devices and machines, like printers, computes, clients, etc. Over time, that has evolved into something different. Today, there are a variety of architectures and systems in use. For example, you can think of a cell phone network as a type of distributed system, consisting of a network of internet-connected devices that share resources workload.

 

Peer-to-Peer

In this type of network, workloads are distributed across hundreds or thousands of different machines. Blockchain is a good example of this. There is no single server or machine that takes care of the workload.

 

Three-Tier

A three-tier system is a software application architecture that consists of a presentation layer, application layer, and data, or core, layer. The benefit of this system is that each tier runs independently from the other tiers, so they can be updated or scaled without impacting the other tiers.

 

Multi-Tier

Multi-tier, also known as N-tier, is any application architecture with more than a single tier, however, this is not as common because the more layers that are used, the more complex and difficult it is to manage. And who really wants more complexity? Therefore, the three-tier system is still popular today and when multi-tier or N-tier systems are being discussed, it is typically referring to three-tier systems.

 

Key Characteristics of a Distributed System

 

No Shared Clock

Each node maintains their own local clock, or timer, and time values may be different between nodes. Synchronization is achieved through a logical clock to maintain and order events.

 

No Shared Memory

Each process has its own independent memory that it works with. State is distributed through the system.

 

Concurrency

Software and hardware components are autonomous and execute tasks concurrently. Concurrency refers to the system’s ability to carry out multiple tasks in parallel and manage the access and usage of shared resources.

 

Heterogeneity

A distributed system comprises of a variety of hardware and software components with different operating systems and technologies, meaning the processors are separate and independent of each other. For everything to work together in harmony, middleware is used to act as the conduit between these different hardware and software components. There are also several types of middleware, including database middleware, application server middleware, message-oriented middleware (MOM), web middleware, transaction processing (TP) middleware, and many more.

 

Benefits of a Distributed System

Distributed systems are ideal for large-scale projects where resources can be shared across the network, helping to create and build a more reliable system. All these benefits provide organization with a more flexible and robust system.

 

Scalability

In a distributed system, new devices or machines can be added (or removed) inexpensively by taking advantage of the computing power of individual nodes. This provides organizations to scale out horizontally to provide better functionality and performance.

 

Reliability

Another key benefit of distributed systems is reliability. The interconnections between nodes and the rest of the system makes it possible to communicate and share data efficiently.

 

Fault Tolerance

Fault tolerance is the ability of a system to respond positively to a failure within the software or hardware components and services will to continue operating properly. Like we mentioned earlier, in a distributed system, resource sharing is a key component, which makes fault tolerance possible.

 

Disadvantages of Distributed Systems

We all know that there is no such thing as a perfect system and there is a tradeoff for every decision made, especially in cases where complex, modern technologies are involved. Here are some of the disadvantages of distributed systems.

 

Complexity

The more complex the environment, the more difficult it becomes to manage. More nodes and devices mean more data being processed and passed along, which creates a higher chance that something is going to go wrong. For example, this can lead to data synchronization issues and node failures. This is a crucial reason organizations need to invest in a monitoring solution. We will dive more into monitoring distributed systems in the following sections. This also gets into the SRE principle of embracing risk and learning how to manage failures.

Read: SRE Principles: The 7 Fundamental Rules

 

Latency

Reiterating what we mentioned in the above section, the more complex and distributed the system becomes, the more room for error. This also includes latency, or the time it takes for data or a request to get through a network. It is also one of the four golden signals of monitoring, which also includes traffic, error, and saturation.

 

Security

Managing a small, centralized system can be considered a walk in the park compared to distributed systems. Security is of utmost importance in all aspects of the system, from the network itself all the way through the devices and nodes that connect into it. Organizations must ensure that data is secure, especially with the rise of the Internet of Things landscape, where any device could become a potential security risk.

 

Resources

Big systems cost big money. There is no way around it. There is the initial investment in hardware and software, plus the general maintenance and overhead to support the continuous development and deployment of applications and services.

 

Monitoring a Distributed System

The slow shift from monolithic systems to distributed systems has changed the way organizations and teams think about monitoring their infrastructure, websites, applications, APIs, etc. No longer focused on one single giant system, the traditional methods of monitoring have needed to evolve as well to meet the needs of modern organizations. While modern DevOps and Agile practices try to ensure that when applications and services move into production there are no bugs present, there is still a chance that performance issues will eventually rear their ugly head. Not only that, the focus on the user experience is paramount, especially in today’s mobile-first landscape. Teams must ensure that they are also monitoring performance from the user’s perspective, as well as the system itself.

For SREs, the definition of monitoring can mean a lot of different things, however, there are a couple of distinct types: white-box monitoring and black-box monitoring.

 

White-box Monitoring

White-box monitoring concerns itself with understanding how your applications run on the server. The metrics measured could be monitoring HTTP (Hypertext Transfer Protocol) requests, response codes, user metrics, etc. Think of white-box monitoring as a window into the internal system. White-box monitoring is used to understand or predict why something may fail.

 

Black-box Monitoring

On the flip-side, black-box monitoring is focused on server metrics like disk space, CPU, memory, load, etc., which are typically thought of as the core monitoring metrics, and understanding performance from the end user’s perspective. Black-box monitoring is used to understand why something within the system is not working correctly.

 

The Best of Both Worlds

Even though there may be two distinct types of monitoring that define help the responsibilities of an SRE, rarely is just one type of monitoring used solely by itself. Typically, a combination of each type is used. Depending on how critical the application or service is, white-box monitoring may be used to head off potential issues. Black-box monitoring may be used in cases where an SRE or team may need to be alerted immediately for issues that impact users.

 

Conclusion: Monitoring Distributed Systems

Dotcom-Monitor provides multiple solutions that meet the unique needs of site reliability engineers and DevOps teams to monitor end-to-end performance of websites, applications, APIs, services, and infrastructure. Along with features like customizable alerting options, performance dashboards, comprehensive reports, and analytics, the Dotcom-Monitor platform allows SRE and performance monitoring teams to quickly identify availability, uptime, and performance issues at scale. Setting up proactive, synthetic monitoring tasks is critical for complex, distributed systems, especially where the end user experience is concerned.

The Dotcom-Monitor platform can help teams quickly and efficiently pinpoint the causes for performance issues, whether at the infrastructure or end user level. Real-time dashboard, analytics, and log data provide a continuous stream of monitoring metrics so you can be sure your systems, applications, sites, and services are performing as intended. Alerts can be customized to meet the requirements of your team and can integrate with the communication and collaboration tools you already use.

Get started with the Dotcom-Monitor platform today with the free trial! Of if you prefer a one-on-one walk-through of the platform and individual solutions, contact our team for a live demo.

 

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn
Share on email
Email
Share on print
Print