Tag: Web hosting

Virtual Block Storage Crashed Your Cloud Again :(


You know it’s bad when you start writing an incident report with the words “The first 12 hours.” You know you need a stiff drink, possibly a career change, when you follow that up with phrases like “this was going to be a lengthy outage…”, “the next 48 hours…”, and “as much as 3 days”.

That’s what happened to huge companies like NetFlix, Heroku, Reddit,Hootsuite, Foursquare, Quora, and Imgur the week of April 21, 2011. Amazon AWS went down for over 80 hours, leaving them and others up a creek without a paddle. The root cause of this cloud-tastrify echoed loud and clear.

Heroku said:

The biggest problem was our use of EBS drives, AWS’s persistent block storage solution… Block storage is not a cloud-friendly technology. EC2, S3, and other AWS services have grown much more stable, reliable, and performant over the four years we’ve been using them. EBS, unfortunately, has not improved much, and in fact has possibly gotten worse. Amazon employs some of the best infrastructure engineers in the world: if they can’t make it work, then probably no one can.

Reddit said:

Amazon had a failure of their EBS system, which is a data storage product they offer, at around 1:15am PDT. This may sound familiar, because it was the same type of failure that took us down a month ago. This time however the failure was more widespread and affected a much larger portion of our servers

While most companies made heartfelt resolutions to get off of EBS, NetFlix was clear to point out that they never trusted EBS to begin with:

When we re-designed for the cloud this Amazon failure was exactly the sort of issue that we wanted to be resilient to. Our architecture avoids using EBS as our main data storage service…

Fool me once…

As Reddit mentioned in their postmortem, AWS had similar EBS problems twice before on a smaller scale in March. After an additional 80+ hours of downtime, you would expect companies to learn their lesson, but the facts are that these same outages continue to plague clouds using various types of virtual block storage.

In July 2012, AWS experienced a power failure which resulted in a huge number of possibly inconsistent EBS volumes and an overloaded control plane. Some customers experienced almost 24 hours of downtime.

Heroku, under the gun again, said:

Approximately 30% of our EC2 instances, which were responsible for running applications, databases and supporting infrastructure (including some components specific to the Bamboo stack), went offline…
A large number of EBS volumes, which stored data for Heroku Postgres services, went offline and their data was potentially corrupted…
20% of production databases experienced up to 7 hours of downtime. A further 8% experienced an additional 10 hours of downtime (up to 17 hours total). Some Beta and shared databases were offline for a further 6 hours (up to 23 hours total).

AppHarbor had similar problems:

EC2 instances and EBS volumes were unavailable and some EBS volumes became corrupted…
Unfortunately, many instances were restored without associated EBS volumes required for correct operation. When volumes did become available they would often take a long time to properly attach to EC2 instances or refuse to attach altogether. Other EBS volumes became available in a corrupted state and had to be checked for errors before they could be used.
…a software bug prevented this fail-over from happening for a small subset of multi-az RDS instances. Some AppHarbor MySQL databases were located on an RDS instance affected by this problem.

The saga continues for AWS who continued to have problems with EBS later in 2012. They detail ad nauseam, how a small DNS misconfiguration triggered a memory leak which caused a silent cascading failure of all the EBS servers. As usual, the EBS failures impacted API access and RDS services. Yet again Multi-AZ RDS instances didn’t failover automatically.

Who’s using Virtual Block Storage?

Amazon EBS is just one very common example of Virtual Block Storage and by no means, the only one to fail miserably.

Azure stores the block devices for all their compute nodes as Blobs in their premium or standard storage services. Back in November, a bad update to the storage service sent some of their storage endpoints into infinite loops, denying access to many of these virtual hard disks. The bad software was deployed globally and caused more than 10 hours of downtime across 12 data centers. According to the post, some customers were still being affected as much as three days later.

HP Cloud provides virtual block storage based on OpenStack Cinder. See related incident reports here, here, here, here, here. I could keep going back in time, but I think you get the point.

Also based on Cinder, Rackspace offers their Cloud Block Storage product. Their solution has some proprietary component they call Lunr, as detailed in this Hacker News thread so you can hope that Lunr is more reliable than other implementations. Still, Rackspace had major capacity issues spanning over two weeks back in May of last year and I shudder to think what would have happened if anything went wrong while capacity was low.

Storage issues are so common and take so long to recover from in OpenStack deployments, that companies are deploying alternate cloud platforms as a workaround while their OpenStack clouds are down.

What clouds won’t ruin your SLA?

Rackspace doesn’t force you to use their Cloud Block Storage, at least not yet, so unless they are drinking their own kool-aid in ways they shouldn’t be, you are hopefully safe there.

Digital Ocean also relies on local block storage by design. They are apparently¬†considering other options but want to avoid an EBS-like solution for the reasons I’ve mentioned. While their local storage isn’t putting you at risk of a cascading failure, they have been reported to leak your data to other customers if you don’t destroy your machines carefully. They also have other fairly frequent issueswhich take them out of the running for me.

The winning horse

As usual, Joyent shines through on this. For many reasons, the SmartDataCenter platform, behind both their public cloud and open source private cloud solutions, supports only local block storage. For centralized storage, you can use NFS or CIFS if you really need to but you will not find virtual block storage or even SAN support.

Joyent gets some flack for this opinionated architecture, occasionally even from me, but they don’t corrupt my data or crash my servers because some virtual hard disk has gone away or some software upgrade has been foolishly deployed.

With their recently released Docker and Linux Binary support, Joyent is really leading the pack with on-metal performance and availability. I definitely recommend hitching your wagon to their horse.

The Nooooooooooooooo! button

If it’s too late and you’re only finding this article post cloud-tastrify, I refer you to the ever amusing Nooooooooooooooo! button for some comic relief.

ISPs Complaining about P2P

Recently this issue came up on the linux-il mailing list. Apparently one of the bigger Israeli ISPs started enforcing a bandwidth cap clause in their Terms of Service after they realized that their lines were overloaded. A couple people pulled out the following statistic:

P2P still represented 60% of Internet traffic at the end of 2004- http://www.cachelogic.com/research/2005_slide07.php

I’ve seen this and similar statistics before but why people are surprised by it, I never understand. The fact is meaningless but, for some reason, everyone thinks it proves that the ISPs are right.

The truth is that even if 95% of the Internet users only used the Internet for email, P2P could still theoretically take up 60% of the bandwidth because it is inheirintly a very high bandwidth application.

Let’s assume for the sake of arguement that there are 100 users on the Internet.

  • 25 of the users are using only P2P and 75 of the users are using only Email.
  • All users have a 1M connection.
  • The P2P users download 24/7 giving them 24M bandwidth usage per day.

Even if each email user downloads 5.33M of email each day, the P2P users still used 60% of the bandwidth.

Total: 1000

In reality P2P users will probably have higher speed connections than Email users which will give them an even more disproportionate share of the bandwidth.

Once we’ve decided that P2P will always have a huge share of the bandwidth regardless of what percentage of people are actually using it, the real questions become:

  1. Maybe P2P is accounts for 60% of what’s being used, but how much isn’t being used?
  2. Isn’t this fact, that some people are using more bandwidth than others, the same reason that ISPs can overbook their lines and make a profit?
  3. If an ISP does a crummy job of planning their “overbooking” should the customers pay the price?

Imagine if an ISP would secretly give every new customer 5M lines for a month and a half- then all of a sudden the speed drops to 1.5M.

Joe schmoe doesn’t know what’s hit him and when he calls customer service, the rep tells him “Oh I’m sorry – we accidentally gave you a 5M line and only just corrected the mistake but don’t worry we won’t charge
you for it- BTW are you interested in our new special on 5M lines?”

It’s the same here- for months/years they didn’t say anything. Now when people are used to it- they come and ask for money. It is their own fault that they overbook the lines- they should deal with it and if they want to limit new customers- gei gezunt.

I have to say that I’m pretty sure I fall into the category of the Email users and I don’t use the ISP in question but if I were in the place of their newly “capped” customers, I would switch to another ISP the same day.