In general, an Uptime value reflects a percentage of time, measured within the bounds of a specified period, during which Dotcom-Monitor receives successful responses from monitoring agents (“Agents” hereinafter). The Downtime value reflects a percentage of time, measured within the bounds of a specified period, during which Dotcom-Monitor received negative responses. Dotcom-Monitor worldwide network of Agents precisely interpret these responses using a specific method. However, this basic description of Uptime vs. Downtime values doesn’t allow for the specific Uptime calculation realities which many organizations face. Specifically, when organizations consider their processes and goals it’s clear that choices need to be made due to the fact that by there is an interdependent relationship between the definitions of the Uptime value and Downtime value. The two values influence each other.

With that in mind, below are several examples for asking the question “How do you define Downtime?”:

  • If you have scheduled maintenance on your web server every Sunday evening, is your website down?
  • If your Chicago-based web server cannot be reached from Orlando, FL (but is available from the rest of the USA) because the backbone provider Time Warner is having an issue in Orlando, is your website down?
  • If a third-party hosted elements (say a chat widget) is experiencing a server error, but the rest of your website is available, is your website down?
  • If your website is not available from anywhere in the world –due to a server hiccup, which last 5 seconds – is your website down?
  • If your retail website’s shopping cart is working, but your About Us page is not, is your website down?
  • If one DNS server is down, but three others are working, so 25% of clients cannot get access to the website after the cached time-to-live  (TTL) expires, is it a down condition?
  • If one of three web servers in a web farm went down and the page response time increased by 10%, or 25%, or 50% (slower page load time) when is this downtime?

If the initial answers for Downtime meant waking up at 2 am to address the issue would any of the answers change?

Uptime/Downtime calculation approach

The approach provides the ability to carefully define how Dotcom-Monitor interprets responses as either “Up” or “Down” responses. This is accomplished using filters.

Incidentally, a filter can also both be applied to a device (cutting false triggering) and to any type of reporting.

Filtering defines the Up/Down states using the following adjustable criteria:

  • Error is reported for a specified number of minutes
  • Error is confirmed by a specified number of agents
  • Error is detected in a specified number of tasks.

All filters and their settings are available at Configure > Filters. After a filter is applied to a device all of the device’s notifications are based on the filter’s criteria. “Default Filter” is assigned to all new devices. The default filter has a balanced configuration and is suitable for most monitoring devices.

Uptime/Downtime Calculations

The formula for the Downtime calculation is as follows:

1. Downtime Duration is tied directly to the configurations within the filter:

  • The Downtime period starts when a filter’s conditions are met. For example, when the number of agents which report a failure equals the number of agents specified in the filter, and as also specified the conditions are met for the number of minutes and tasks, then a downtime alert is sent.
  • The Uptime period starts when the filter’s conditions are no longer met. Specifically, Uptime starts when the number of agents, minutes, or tasks, which have reported “up” success, no longer meet the conditions needed for the filtered “down” conditions. For example, an “up” state is indicated when the number of error (“down”) responses received by agents becomes less than the number of error (“down”) responses that agents need, as set in the filter, in order to indicate a “down” condition.

2. Duration of “Undefined” state. An Undefined state can be set when the status of each agent involved in monitoring becomes Undefined. An agent status is considered as Undefined if the agent does NOT provide any response (error or success) in a certain amount of time:

Response Wait Time Duration = (the overall agents number+1) × monitoring frequency + 15 min

For example, we use three monitoring agents and a monitoring frequency of 5 minutes. Each agent will wait for a response for Response Wait Time Duration = (3+1)×5+15 min = 35 min. Once the time expired and no response received, an agent reports Undefined.

3. Duration of “Postponed” state.  Postponing a device at any moment will stop any monitoring activity until it is re-enabled.

4. Duration Excluded by Schedule. Another entity that can significantly affect Uptime/Downtime calculations is Schedules. This is an option for managing your monitoring during routine maintenance. Monitoring can be postponed for specific days of the week as well as specific hours and minutes during a day. To set up a schedule, follow the instruction.

Any change in a device settings (including device restart) during the Down state will reset the state so no uptime alert will be sent.

EXAMPLE:

example_regular

Let’s say we have device monitored from 7 locations and filter set that 3 locations must report an error for Downtime condition. First, monitoring node (agent 1) detects an error while the rest are still reporting success, then the second (agent 2) and at last third one (agent 4) detects an error at T4 which triggers filter to set Downtime beginning right from this moment. The Down state will remain until you set hypothetical Postpone at T5 because of the number of agent reporting errors higher than adjusted 3 throughout all this time. The time gap between T6 and T7 is an illustration of the fact we get the first response with a delay (monitoring session processing time includes network transfer delays and the execution itself), so “Postponed” time is being calculated as ∆ (T7–T5)  (Postponed 2nd). Again, we fall into Downtime only on 3rd error from Agent 3 and get in the Up state only on the T9 response, when the number of failing agents becomes less than adjusted in the filter. Here comes the final downtime % calculation formula for this case: