It’s been awhile since this blog has discussed cloud outages (for example: quick link). The recent reports of Microsoft and Amazon outages gives one pause to contemplate. First, cloud outages are facts of life, and they in no way should deter anyone from embracing cloud technology – public, private, or hybrid. On the other hand, anyone adopting any technology should have some deep thoughts about how to deal with the inevitable failure that this technology will bring to your operation.
First, what has happened? Microsoft Azure has recently has two outages, and Amazon AWS has had one.
Let’s start with an Azure outage on Feb 29, 2012 where a leap year bug took out VM services for 13 hours and 23 minutes. Apparently, when a VM was initiated on Feb 29, 2012 for a year, it was given a certificate stating that it was valid until Feb 29, 2013, which is an illegal date. This initialization failed with an erroneous interpretation that the server itself had failed, and an attempt to initialize this VM on another physical server was attempted. This next attempt failed for the same reasons, and you can see that further attempts created a real mess. Note that this mess was caused by TWO bugs, not just the leap year bug.
OK, so it took 13 hours and 23 minutes to patch this bug on all but seven Azure clusters. Those clusters were in the middle of a different upgrade. What to do? Microsoft’s effort bombed. They attempted to roll back and patch, but they failed to revert to an earlier version of a network plug-in that configures a VM’s network. The new plug-in was incompatible with the older, patched, host and guest agents, and all VMs in these 7 clusters were immediately disconnected from the network! Cleaning up this new mess took until 2:15 AM the next day. The total lack of full functionality lasted over 26 hours.
What to fix? Clearly the sophomoric leap year bug was fixed along with some testing for date/time incompatibilities among software components. Fixed also was the problem of declaring an entire server bad, when just a VM had problems. Finally, Microsoft intelligently added graceful degradation to VM management by blocking new VMs or extending old ones, instead of rashly shutting down the entire platform due to a small problem.
Because their customer service lines were swamped this sad leap year day, Microsoft also upgraded its error detection software to detect problems faster, and it upgraded its customer dashboard to improve its availability in the presence of system problems. Outage notification via Twitter and Facebook has now been at least partially implemented.
Next, on June 14, Amazon’s AWS center in Virginia experienced severe storms and consequently the failure of back-up generators (See  for a frank and excellent root-cause analysis that includes bugs found and plans to improve backup power and to fix the bugs.) Related power-related problems took down portions of the data center. Multiple services and some hosted web sites were down for several hours. It was reported  that this was a “Once in a lifetime storm”, but it took down Netflix, Instagram, and Pinterest. Once power was restored, Amazon went to work restoring “instances” (running jobs) and storage volumes. Amazon also reported unusually high error rates for awhile, the cause of which will probably not be known. Amazon calculated  their outage to be 5 hours and 20 minutes.
Next, on July 26, 2012 11:09 AM, Microsoft announced  “an availability issue” for Windows Azure in the West Europe sub-region. At 1:33 PM, they announced that this issue was resolved. This was an outage of 2 hours and 24 minutes, although Microsoft totaled the outage as having duration 3.5 hours . Apparently storage and running applications were not affected. As of this post writing, I can find no root cause analysis that has been published by Microsoft.
OK, what can we learn from these (and other) outages? First cloud technology is new, and even the most experienced pioneer, Amazon, has problems. Second, putting all of one’s computational eggs in one cloud vendor’s data center basket is not going to give you a five nines system, and you may well lose important data.
The obvious, and expensive, solution is to duplicate your cloud based systems across multiple geographies and to have a fail-over strategy from your primary system location to your secondary location. I’ve seen people recommend the use of two different cloud vendors, but I find it hard to believe that the pain and cost of two different vendors are worth it. The disaster data seem to indicate low probability of systemic and instantaneous errors occurring across a vendor’s entire set of data centers. (Although Amazon’s EC2 and EBS failures in April 2011 did affect two “Availability Zones.” ) In addition, while cloud vendors have at best skimpy data center fail-over services, you might as well use what they have. It is interesting that [9, 10] Adrian Cockroft of Netflix argues for using three availability zones (presumably in different geographic locations) with no extra (live) instances.
What about private clouds? Well, they are great, and they provide improved availability. On the other hand, they are just as susceptible to disasters as regular private data centers. The good news, is that fail-over to a public cloud may give reduced performance, but it may be a cost-effective business continuity strategy. This fail-over to a public cloud may also take a long time to “spin up”, because public cloud vendors take a long time (20 to 40 minutes) to instantiate a job as big as a private cloud, even if the data are all on its file system and the capacity is “reserved.”
An alternative, of course, is to duplicate your private cloud at a second geographic location. The spin up time would be much less. Private clouds are usually local in order to provide high bandwidth and low latency to the clients. This performance advantage would be lost at a remote location, but the business continuity may be worth this degradation in network performance until the local private cloud is up again.
All data center fail-over mechanisms require a reasonably continuous backup stream to the remote data site and the ability to launch the private cloud system at the remote site. Those transactions that didn’t get recorded at the remote site will probably be lost, and a restart (from a checkpoint) mechanism is essential.
Let’s analyze availability ratings for Azure and AWS. Recall, the availability of a system is the percent of time the system is fully functional. Microsoft’s Azure was not fully functional, due to the two outages discussed here, for 29.5 hours. If this was the only time the system was not fully functional (most likely there would be other shorter and unreported partial outages) for the year, then Azure would have no greater than a 99.66% availability rating. For Amazon, the loss of full functionality in 2011 (see quick link) was several days although within 12 hours 60 percent of the instances were restored. Let’s estimate 48 hours of not full functionality in 2011 and just 5.3 hours (so far) 48 in 2012. This would average 99.7% availability (less if I didn’t use zero downtime for the rest of 2012.)
Now some could argue that these outages didn’t affect all their datacenters, and that the availability ratings – based on say 10 data centers – should be an order of magnitude better. Well, ok, this would rate them both at three nines not two nines and change. But the point is that this is VERY FAR from five nines. The conclusion is again that depending on only a single data centers represents very low availability even before one factors in bugs and other problems in the customer’s system.
Finally, I note that many of the large cloud vendor’s problems are due to the fact that they are large and hence complex. A private cloud would well have less complexity, have a simple redundancy story, and hence a higher availability rating. Sadly, I haven’t seen such an example.
 http://www.forbes.com/fdc/welcome_mjx.shtml (This site has (or had) an apparent AWS ad with the cute caption “Cloudy with a chance of fail.” Clicking on it led to