Amazon Cloud Outage

Thoughts on the Recent Amazon Outage and Cloud Network Reliability

Gayn B. WintersPh.D.

May 10, 2011

The Amazon team should be complimented for their excellent root cause analysis report, “Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region.” []  It should be required reading for everyone working in Cloud computing.

Having studied disasters of all types, this report reminded me of the “Rule of Three.”  Namely, almost all disasters occur as a consequence of a combination of three “bad” events. Chernobyl, Three Mile Island, Fukushima, ship disasters, aircraft crashes, etc., all seem to have this property.  It is intuitive that most systems can deal with one failure, and many can deal with two – perhaps with some damage.  But three failures can be catastrophic.  Worse, they lead to other problems and expensive recoveries.

In the case of the Amazon outage, the initial cause or fault was a network reconfiguration that mapped a primary storage subnet to one with insufficient capacity (a combination of a human error and insufficient automation to check the target subnet’s capacity automatically); other problems surfaced quickly.  The list of additional causes here was way more than the additional two of which the Rule of Three warns us. Here’s a list:  Insufficient overall capacity, a race condition bug, poor back-off algorithms on retries for failed attempts at replication, the control plane running out of thread capacity and unable to service negotiation requests (a design and a capacity issue), and a lack of fine grained alarming in the control plane.

After the ship’s crew recovers from the storm that knocks over the mast, kills the engine drive, and puts a hole in the hull, it still needs to get the ship back to port.  In this case, once the Amazon team stabilized its Cloud Network, it still needed to recover all the customer data.  The team needed to utilize multiple techniques to do this.  This effort surfaced additional bugs and problems, which the excellent report also details along with the team’s solutions.  All in all, very little customer data were lost, and even though zero data loss should be the goal, the Amazon team should again be complimented.

The report contains a list of corrective and preventative measures that apply to all Cloud Networks:  A careful analysis of “adequate capacity”, more intelligent automation of maintenance tasks, aggressive and more intelligent retry algorithms for replication and other failures, fixing the bugs found, better utilization of independent Clouds or Zones, and most importantly better education of and communication with the Cloud’s customers.

Additional required reading should be Richard Feynman’s appendix to the Rogers Commission Report on the Space Shuttle Challenger Accident.  In this appendix, Feynman pointed out that NASA management thought the ratio of success to failure was 100,000 to 1, while the working engineers thought it was around 100 to 1.  This is a factor of 1000 difference!  To date, there have been 130 flights and 2 failures (Challenger and Columbia) for a 65 to 1 success to failure rate.  Even the engineers were optimistic!

Feynman’s 1986 analysis of the Space Shuttle program makes many observations of inadequate thinking surrounding the quality and safety of the program.  Unfortunately, many of the problems that Feynman pointed out were not fully addressed, and on February 1, 2003 Space Shuttle Columbia fell apart only minutes from an expected landing after an otherwise apparently successful mission.  While it is impossible to compare NASA’s shuttle program with a Cloud Network implementation, and it is further impossible to compare Amazon’s report with that of Feynman, there is a lesson to be learned.  A Cloud Network implementation should have a similar level of analytical scrutiny, and we need to be vigilant to avoid an “optimistic management” trap of thinking that Cloud Networks are impervious to failures or invulnerable to security problems.

We have certainly not seen our last airline crash nor even our last nuclear disaster.  With only two Space Shuttle flights remaining, we can pray for their success.  That said, Cloud Network computing is both new and complex, and it therefore seems likely that the Amazon outage will not be the last big outage of Cloud Networks.  The Rule of Three will surely bite us again.

Feynman also writes about how NASA kept bugs out of its software.  This is worth reading as well. The Amazon report’s analysis of the outage and the recommendations for corrections and improvements also provide at least several hints as to how we should test Cloud Network implementations.

  1. Since the Amazon analysis concluded that capacity was a factor in the outage, a Cloud Network should definitely be stress tested with its nominal capacity significantly exceeded.
  2. Since the initial fault for the outage was a failed administrative task, a Cloud Network should be tested using the technique of “fault insertion.”  In other words, intentionally make something bad happen, and see what the consequences are.  It should be noted that the Amazon initial fault and its unfortunate consequences uncovered multiple code bugs.
  3. Combining items 1 and 2, fault insertion should be part of the stress testing.  In other words, analyze what the system does when a fault occurs and the system is being stressed.  This includes forcing all redundant components to fail under stress and in various combinations.
  4. Any outage within a Cloud Network has the potential to lose or corrupt customer data.  Thus the mechanisms to recover data from backups, replicas, and transaction logs need to be thoroughly tested.
  5. As part of fault insertion techniques, processes should be artificially killed and storage devices should be artificially corrupted.  The consequences of such faults should then be analyzed, and the data recovery techniques should be further tested.
  6. A control or monitoring function such as Amazon’s control plane also has the potential to cause or exacerbate problems.  Such functionality should receive significant testing that includes stress testing and fault insertion.
  7. The Amazon outage was fortunately not caused by or even complicated by security problems.  Other systems may not be so fortunate.  A wide variety of both internal and external security attacks should be simulated and a thorough analysis of the results should be done.  Such simulated attacks should include attacks on the control and monitoring functions.

As a final comment, mobile devices are predicted to be a significant if not a dominant source of end user access to Cloud Networks in the near future.  Such access has the potential to produce new types of load or stress, new types of corruption, and new types of security problems for Cloud Networks.  This dimension of Cloud Network computing goes way beyond the Amazon report and needs to be thoroughly analyzed.



Tags: , , , , , , , , , , , , ,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: