In a recent report by IDC titled, “DevOps and the Cost of Downtime: Fortune 1000 Best Practice Metrics Quantified,” the cost of downtime was explored across Fortune 1000 organizations. The numbers they arrived at may surprise you. The average cost of application downtime among fortune 1000 companies was estimated at somewhere between 500,000 and 1 million dollars per hour! The total cost of downtime per year was estimated to be between 1.25 billion and 2.5 billion dollars per year.
Direct Sales, Consumer Trust, Brand Management, and Loss of Productivity
While these estimates are probably on the high end when talking about small to medium business, they still point to the fact that downtime has a real cost for business – whether it is the loss of direct sales, loss of consumer trust due to poor user experience, negative brand exposure, or loss of productivity – downtime leads to losses. The trouble most businesses have dealing with downtime is quantifying what will be lost in some unknown future incident. Unfortunately, most organizations do not place an emphasis on minimizing downtime until they have been hurt by an outage, and even then, as the incident grows more distant over time, the sense of real loss tends to fade, and the cost of mitigating future incidents tends to appear unnecessary. So what is the cost of downtime, and how do we quantify it?
The True Cost of Downtime
Downtime not only affects direct sales in the case of a consumer facing shopping cart, but it can have an underlying psychological affect on both external users such as a customer or a new prospect and internal users including employees and stockholders. It makes sense that an online shopping cart which averages $10,000 in sales per hour would tend to lose $10,000 for every hour the website is down, but how do you calculate the number of repeat customers that happened to visit the site during the outage and determine that this particular vendor is no longer trustworthy and therefore, they never return for future purchases?
What happens when a large investment firm sees outages as a sign of instability and divests of the companies stock?
What about the employees using an internal application or a SaaS app that frequently crashes or experiences outages?
The quick and dirty answer to all of these questions is that there is no way to know the exact value of these lost opportunities, or the negative brand perceptions built during such outages. Studying the big players who have been through similar negative publicity events, such as the 2013 Target credit card breach, to which Target responded with store-wide discounts as well as one year of free credit reports for their customers, tells us that there must be quite a bit of anticipated damage due to the extent of damage control such retailers are willing to undertake. A blog post at Mailchimp regarding handling website outages identifies several recent outages of large websites and speaks about the importance of proper communication of the measures taken to mitigate the outage and future actions that will be taken to ensure it does not happen again. As Joel Spolsky put it in one of his blog entries “It’s the unexpected unexpecteds, not the expected unexpecteds, that kill you.” So the next question is: How do we deal with downtime?
Combating Optimism Bias
It is important to realize that our fundamental view of our place in the world tends to be biased. It is a part of human nature to hold a self-centered view of events, where we tend to believe that we are always the above average being, and that generally, bad events will happen to someone else and will never happen to us. This optimism bias tends to lead us to believe that we don’t need to worry or plan for the worst case scenarios, or even small inconveniences such as an application or website outage. Furthermore, the more time that passes between such events, we build an illusion of control where we believe our positive control over the situation enhances the results of the situation. This illusion of positive control makes us even less likely to put mitigation strategies into place to combat such events when they do occur.
The Cost of Murphy’s Law
Seasoned veterans who have been in the IT trenches for a while have inevitably run into Murphy’s Law. Anything that can go wrong eventually will go wrong. The wise among us have taken steps in the right direction to be vigilantly on the lookout for outages, and have plans in place to troubleshoot, eliminate and proactively plan for future events. One difficulty such industry veterans may experience is getting buy-in from management and other departments when they see the costs involved in maintaining an active set of disaster recovery solutions. This is where articles such as IDC’s survey of fortune 1000 organizations come in to help the decision makers understand the value of services like Dotcom-Monitor’s website, application and server performance monitoring tools.
The Benefits of Being Proactive
Most people intuitively understand on some level that it is better to be proactive than to wait until disaster strikes to take action, but lets face it – few people have ever been proactive out of the gate unless they have already been burned – maybe not on this project, maybe not while at this job, but everyone has had their trial by fire which leads them to be “proactive” on their subsequent set of tasks. Because of this reality, it is important for those IT professionals that have been burnt and now wish to be proactive to have concrete examples to share with stakeholders living in the illusion of control.
Setting up active monitoring gives you the piece of mind that as long as you are not receiving alerts, everything is working properly, and when you do receive alerts, you are receiving them in real time, right as the issue has been identified. This proactive approach can also lead to quick resolutions, hopefully before anyone else -particularly your customers- has even noticed that there was a problem. Even if you do not immediately resolve the issue, proactive monitoring is recording a history of what happened so that you can later go back and analyze the event and find ways to ensure that type of an issue will not occur again.
Additional Benefits of Synthetic Monitoring
Once you have synthetic monitoring setup for your websites, servers, web apps, etc… there are quite a few additional benefits beyond simply getting alerts when something goes down. You can record historical trends to find out how the ebb and flow of traffic affects your server responsiveness. As your website grows or your app evolves, you can see how response time is affected by larger and larger amounts of data. You can correlate hardware performance under real user loads to page load speeds of your site, and use this information to plan ahead for future system upgrades.
In other words, hedge against the cost of downtime by implementing website monitoring to proactively prepare for the worst. A website monitoring solution will help you make incremental improvements to your offerings which indirectly helps improve customer loyalty, relieves customer frustration with slow responses and can help identify trending issues so that you can head off critical problems before they even happen. So, if you do not currently have monitoring setup, sign up for a free trial to see how it works for yourself.