An article by Dotcom-Monitor “Caffeinated DNS Monitoring and the AT&T DNS Outage” published on SpeedAwarenessMonth.com regarding the AT&T domain name server (DNS) outage of Aug. 15, 2012 demonstrates why a non-cached method of DNS monitoring results in a faster time-to-repair (TTR), and even zero downtime due to the DNS issue.
The full article is available at SpeedAwarenessMonth.com however the basics include:
To Cache or Not-to-Cache – that is the DNS Monitoring Question
Firstly, it is not generally well-known that external-based HTTP request-type website monitoring, like coffee at your local java joint, comes in different “grades” – cache-based and non-cache based. Dotcom-Monitor employs non-cached monitoring, which propagates through the full DNS process with each monitoring instance. Cache-based monitoring (used by many basic monitoring services) does not propagate through the DNS process and misses DNS issues.
How to Effectively Monitor for the next DNS Outage Situation
In the case of the AT&T DNS outage issue there are several key factors that help to speed up Time-to-Repair (TTR), or avoiding downtime:
- Error Detection method: Use a monitoring solution that uses a non-cache method to propagate DNS queries all the way through to root name servers with each monitoring instance. A cache-method service caches DNS and therefore will not detect a secondary DNS issue at all, or it may take days or even weeks to detect the issue.
- Frequency of monitoring: Use a faster frequency of non-cache monitoring, such as every 1-minute versus once per hour. The faster the non-cache monitoring solution detects and alerts an impacted administrator of a website using a failing DNS service, the faster a switch can be made to a DNS fail-over provider.
- Value of Time-to-Live (TTL) setting: The smaller the value of the TTL setting used by the DNS administrator to persist the IP caching of the a domain from the primary authoritative name server the faster the fail-over to another DNS provider may be implemented. Typically set to 86,400 seconds (1-day) or more, in disaster recovery planning the TTL can be set as low as once every 300 seconds, however the lower the setting the higher the load on the authoritative domain name server.
- Diagnostics – such as an automatic trace-route at the time of the detected DNS problem – is provided by the monitoring solution (keep in mind that many basic monitoring services do not provide any diagnostic info).
- Repair: Continue monitoring during the error condition to further pinpoint the issue. Send the monitored results to your DNS provider. You can also run free manual DNS trace-routes at www.dotcom-monitor.com/WebTools/trace.asp (select Trace Style “DNS”) to verify the issue as needed.
- Prevent: Keep an eye on “soft error” DNS issues, such as DNS slowdowns and intermittent DNS outages, so you can take action before the “soft error” becomes a “hard error” such as a customer facing downtime.
Thanks, I’ll take the Caffeinated Double Depth Charge, Non-cached
Its clear then that a combination of non-cache and other factors limit the downtime exposure due to issues like the AT&T DNS outage of Aug. 15, 2012. Furthermore, a non-cached method of DNS monitoring is a critical factor in a faster TTR, and even zero downtime.
Finally, it is important to remember that TTR determines the loss due to downtime. In other words, the longer total time it takes to detect, diagnose, and repair a DNS problem the worse the impact of the DNS issue. Conversely, the faster a monitoring solution speeds up TTR the more the loss is reduced, or completely avoided.
Similar to a good strong cup of caffeinated coffee a non-cache method can make the difference between a downtime day and a fast productive day.
For more on the AT&T DNS outage see our article, Doing DNS Monitoring Right: The AT&T DNS Outage.