You know it’s bad when you start writing an incident report with the words “The first 12 hours.” You know you need a stiff drink, possibly a career change, when you follow that up with phrases like “this was going to be a lengthy outage…”, “the next 48 hours…”, and “as much as 3 days”.
That’s what happened to huge companies like NetFlix, Heroku, Reddit,Hootsuite, Foursquare, Quora, and Imgur the week of April 21, 2011. Amazon AWS went down for over 80 hours, leaving them and others up a creek without a paddle. The root cause of this cloud-tastrify echoed loud and clear.
The biggest problem was our use of EBS drives, AWS’s persistent block storage solution… Block storage is not a cloud-friendly technology. EC2, S3, and other AWS services have grown much more stable, reliable, and performant over the four years we’ve been using them. EBS, unfortunately, has not improved much, and in fact has possibly gotten worse. Amazon employs some of the best infrastructure engineers in the world: if they can’t make it work, then probably no one can.
Amazon had a failure of their EBS system, which is a data storage product they offer, at around 1:15am PDT. This may sound familiar, because it was the same type of failure that took us down a month ago. This time however the failure was more widespread and affected a much larger portion of our servers
While most companies made heartfelt resolutions to get off of EBS, NetFlix was clear to point out that they never trusted EBS to begin with:
When we re-designed for the cloud this Amazon failure was exactly the sort of issue that we wanted to be resilient to. Our architecture avoids using EBS as our main data storage service…
Fool me once…
As Reddit mentioned in their postmortem, AWS had similar EBS problems twice before on a smaller scale in March. After an additional 80+ hours of downtime, you would expect companies to learn their lesson, but the facts are that these same outages continue to plague clouds using various types of virtual block storage.
In July 2012, AWS experienced a power failure which resulted in a huge number of possibly inconsistent EBS volumes and an overloaded control plane. Some customers experienced almost 24 hours of downtime.
Heroku, under the gun again, said:
Approximately 30% of our EC2 instances, which were responsible for running applications, databases and supporting infrastructure (including some components specific to the Bamboo stack), went offline…
A large number of EBS volumes, which stored data for Heroku Postgres services, went offline and their data was potentially corrupted…
20% of production databases experienced up to 7 hours of downtime. A further 8% experienced an additional 10 hours of downtime (up to 17 hours total). Some Beta and shared databases were offline for a further 6 hours (up to 23 hours total).
AppHarbor had similar problems:
EC2 instances and EBS volumes were unavailable and some EBS volumes became corrupted…
Unfortunately, many instances were restored without associated EBS volumes required for correct operation. When volumes did become available they would often take a long time to properly attach to EC2 instances or refuse to attach altogether. Other EBS volumes became available in a corrupted state and had to be checked for errors before they could be used.
…a software bug prevented this fail-over from happening for a small subset of multi-az RDS instances. Some AppHarbor MySQL databases were located on an RDS instance affected by this problem.
The saga continues for AWS who continued to have problems with EBS later in 2012. They detail ad nauseam, how a small DNS misconfiguration triggered a memory leak which caused a silent cascading failure of all the EBS servers. As usual, the EBS failures impacted API access and RDS services. Yet again Multi-AZ RDS instances didn’t failover automatically.
Who’s using Virtual Block Storage?
Amazon EBS is just one very common example of Virtual Block Storage and by no means, the only one to fail miserably.
Azure stores the block devices for all their compute nodes as Blobs in their premium or standard storage services. Back in November, a bad update to the storage service sent some of their storage endpoints into infinite loops, denying access to many of these virtual hard disks. The bad software was deployed globally and caused more than 10 hours of downtime across 12 data centers. According to the post, some customers were still being affected as much as three days later.
Also based on Cinder, Rackspace offers their Cloud Block Storage product. Their solution has some proprietary component they call Lunr, as detailed in this Hacker News thread so you can hope that Lunr is more reliable than other implementations. Still, Rackspace had major capacity issues spanning over two weeks back in May of last year and I shudder to think what would have happened if anything went wrong while capacity was low.
Storage issues are so common and take so long to recover from in OpenStack deployments, that companies are deploying alternate cloud platforms as a workaround while their OpenStack clouds are down.
What clouds won’t ruin your SLA?
Rackspace doesn’t force you to use their Cloud Block Storage, at least not yet, so unless they are drinking their own kool-aid in ways they shouldn’t be, you are hopefully safe there.
Digital Ocean also relies on local block storage by design. They are apparently considering other options but want to avoid an EBS-like solution for the reasons I’ve mentioned. While their local storage isn’t putting you at risk of a cascading failure, they have been reported to leak your data to other customers if you don’t destroy your machines carefully. They also have other fairly frequent issueswhich take them out of the running for me.
The winning horse
As usual, Joyent shines through on this. For many reasons, the SmartDataCenter platform, behind both their public cloud and open source private cloud solutions, supports only local block storage. For centralized storage, you can use NFS or CIFS if you really need to but you will not find virtual block storage or even SAN support.
Joyent gets some flack for this opinionated architecture, occasionally even from me, but they don’t corrupt my data or crash my servers because some virtual hard disk has gone away or some software upgrade has been foolishly deployed.
With their recently released Docker and Linux Binary support, Joyent is really leading the pack with on-metal performance and availability. I definitely recommend hitching your wagon to their horse.
The Nooooooooooooooo! button
If it’s too late and you’re only finding this article post cloud-tastrify, I refer you to the ever amusing Nooooooooooooooo! button for some comic relief.