Amazon’s Service level agreement for it’s cloud computer service is 99.95% uptime for a service region (that’s a data center) during a one year period.
That means that the site will be down five one-hundredth’s of a percent of the time. That means that EC2 should be down for less than four and a half hours per year. Being tester, I set out to evaluate this claim, starting with a google search for “Amazon Ec2 Down 2012.” I found a half-dozen issues in the twenty-minute to half hour range plus a Seven Hour Outage in April.
Instead of tearing into Amazon, I’d like to be fair. They did a lot right. They have a great deal of technical depth, they have multiple redundant switches, internet connections, backup power, and servers. They have technical depth, 24/7 monitoring, and keep most outages to twenty minutes by acting swiftly. When they have those large outages, they are things no one could have predicted – and the folks at Amazon move to prevent them from ever happening again.
The problem is that something else will happen next time.
So if Amazon can’t prevent these sort of “unpredictable behaviors”, what makes us think that we can in traditional organizations? For that matter, if Amazon can’t meet it’s modest 99.95% claims, how can the folks at Microsoft promise to go beyond ‘five nines’ when five nines, or 99.999%, only allows for five minutes of downtime per year?
The first part of the probem is the Turkey problem; the second is a naive definition of the system.
The Turkey Problem
Consider the humble turkey, living on the farm. Each day, the farmer visits the turkey. He protects the turkey; he feeds the turkey, provides it shelter, and keeps it safe from natural enemies. Each day, the turkey has more reason to believe that tomorrow will be a nice, safe day.
Right up until Thanksgiving, when the farmer brings an axe.
Service Level Agreements have to be based on something. Like the Turkey, they are often based on past experience. Past experience has a problem; it can’t predict future events.
So when Amazon has two redundant power systems, they both can go down if a car slams into them. As systems become more complex the failure become more nuanced. In the seven-hour car-crash outage, there was a cutover to internal power, but the automated system made a mistake, and thought the power fault was inside the data center, so it shut down the generators for safety reasons.
The turkey example is not my idea; it comes from a book called “The Black Swan: The Impact of the Highly Improbable” by Nassim Nicholas Talib.
Taib calls these ‘Black Swans’, after the famous logic claim that when every swan that has been found it white, for a million swams, does not disprove that a black swan could exist. (It turns out that they do. English scientist John Latham found black swans in Australia in 1790)
So if a black swan can force Amazon to miss it’s modest Service Level Agreements, what can we do about it in our organizations?
I’ll talk about that more in my next post — but first we’ll have a guest post by Pete Walen, on Yugo Testing.