Tag: OpenStack

Virtual Block Storage Crashed Your Cloud Again :(

darthvader

You know it’s bad when you start writing an incident report with the words “The first 12 hours.” You know you need a stiff drink, possibly a career change, when you follow that up with phrases like “this was going to be a lengthy outage…”, “the next 48 hours…”, and “as much as 3 days”.

That’s what happened to huge companies like NetFlix, Heroku, Reddit,Hootsuite, Foursquare, Quora, and Imgur the week of April 21, 2011. Amazon AWS went down for over 80 hours, leaving them and others up a creek without a paddle. The root cause of this cloud-tastrify echoed loud and clear.

Heroku said:

The biggest problem was our use of EBS drives, AWS’s persistent block storage solution… Block storage is not a cloud-friendly technology. EC2, S3, and other AWS services have grown much more stable, reliable, and performant over the four years we’ve been using them. EBS, unfortunately, has not improved much, and in fact has possibly gotten worse. Amazon employs some of the best infrastructure engineers in the world: if they can’t make it work, then probably no one can.

Reddit said:

Amazon had a failure of their EBS system, which is a data storage product they offer, at around 1:15am PDT. This may sound familiar, because it was the same type of failure that took us down a month ago. This time however the failure was more widespread and affected a much larger portion of our servers

While most companies made heartfelt resolutions to get off of EBS, NetFlix was clear to point out that they never trusted EBS to begin with:

When we re-designed for the cloud this Amazon failure was exactly the sort of issue that we wanted to be resilient to. Our architecture avoids using EBS as our main data storage service…

Fool me once…

As Reddit mentioned in their postmortem, AWS had similar EBS problems twice before on a smaller scale in March. After an additional 80+ hours of downtime, you would expect companies to learn their lesson, but the facts are that these same outages continue to plague clouds using various types of virtual block storage.

In July 2012, AWS experienced a power failure which resulted in a huge number of possibly inconsistent EBS volumes and an overloaded control plane. Some customers experienced almost 24 hours of downtime.

Heroku, under the gun again, said:

Approximately 30% of our EC2 instances, which were responsible for running applications, databases and supporting infrastructure (including some components specific to the Bamboo stack), went offline…
A large number of EBS volumes, which stored data for Heroku Postgres services, went offline and their data was potentially corrupted…
20% of production databases experienced up to 7 hours of downtime. A further 8% experienced an additional 10 hours of downtime (up to 17 hours total). Some Beta and shared databases were offline for a further 6 hours (up to 23 hours total).

AppHarbor had similar problems:

EC2 instances and EBS volumes were unavailable and some EBS volumes became corrupted…
Unfortunately, many instances were restored without associated EBS volumes required for correct operation. When volumes did become available they would often take a long time to properly attach to EC2 instances or refuse to attach altogether. Other EBS volumes became available in a corrupted state and had to be checked for errors before they could be used.
…a software bug prevented this fail-over from happening for a small subset of multi-az RDS instances. Some AppHarbor MySQL databases were located on an RDS instance affected by this problem.

The saga continues for AWS who continued to have problems with EBS later in 2012. They detail ad nauseam, how a small DNS misconfiguration triggered a memory leak which caused a silent cascading failure of all the EBS servers. As usual, the EBS failures impacted API access and RDS services. Yet again Multi-AZ RDS instances didn’t failover automatically.

Who’s using Virtual Block Storage?

Amazon EBS is just one very common example of Virtual Block Storage and by no means, the only one to fail miserably.

Azure stores the block devices for all their compute nodes as Blobs in their premium or standard storage services. Back in November, a bad update to the storage service sent some of their storage endpoints into infinite loops, denying access to many of these virtual hard disks. The bad software was deployed globally and caused more than 10 hours of downtime across 12 data centers. According to the post, some customers were still being affected as much as three days later.

HP Cloud provides virtual block storage based on OpenStack Cinder. See related incident reports here, here, here, here, here. I could keep going back in time, but I think you get the point.

Also based on Cinder, Rackspace offers their Cloud Block Storage product. Their solution has some proprietary component they call Lunr, as detailed in this Hacker News thread so you can hope that Lunr is more reliable than other implementations. Still, Rackspace had major capacity issues spanning over two weeks back in May of last year and I shudder to think what would have happened if anything went wrong while capacity was low.

Storage issues are so common and take so long to recover from in OpenStack deployments, that companies are deploying alternate cloud platforms as a workaround while their OpenStack clouds are down.

What clouds won’t ruin your SLA?

Rackspace doesn’t force you to use their Cloud Block Storage, at least not yet, so unless they are drinking their own kool-aid in ways they shouldn’t be, you are hopefully safe there.

Digital Ocean also relies on local block storage by design. They are apparently considering other options but want to avoid an EBS-like solution for the reasons I’ve mentioned. While their local storage isn’t putting you at risk of a cascading failure, they have been reported to leak your data to other customers if you don’t destroy your machines carefully. They also have other fairly frequent issueswhich take them out of the running for me.

The winning horse

As usual, Joyent shines through on this. For many reasons, the SmartDataCenter platform, behind both their public cloud and open source private cloud solutions, supports only local block storage. For centralized storage, you can use NFS or CIFS if you really need to but you will not find virtual block storage or even SAN support.

Joyent gets some flack for this opinionated architecture, occasionally even from me, but they don’t corrupt my data or crash my servers because some virtual hard disk has gone away or some software upgrade has been foolishly deployed.

With their recently released Docker and Linux Binary support, Joyent is really leading the pack with on-metal performance and availability. I definitely recommend hitching your wagon to their horse.

The Nooooooooooooooo! button

If it’s too late and you’re only finding this article post cloud-tastrify, I refer you to the ever amusing Nooooooooooooooo! button for some comic relief.

SmartDataCenter, the Open Cloud Platform that Actually Already Works

sdc

For years enterprises have tried to make OpenStack work and failed miserably. Considering how many heads have broken against OpenStack, maybe they should have called it OpenBrick.

Before I dive into the details, I’ll cut to the chase. You don’t have to break your heads on cloud anymore. Joyent have open sourced (as in get it on Github) their cloud management platform.

It’s free if you want (install it on your laptop, install it on a server). It’s supported if you want. Best of all, it actually works outside of a lab or CI test suite. It’s what Joyent runs in production for all their public cloud customers (I admit to being one of the satisfied ones). It’s also something they have been licensing out to other cloud providers for years.

Now for the deep dive.

What’s wrong with OpenStack?

First off, it isn’t a cloud in a box, which is what most people think it is. In 2013,Gartner called out OpenStack for consciously misrepresenting what OpenStack actually provides:

no one in three years stood up to clarify what OpenStack can and cannot do for an enterprise.

In case you’re wondering, the analyst also quoted Ebay’s chief engineer on the true nature of OpenStack:

… an instance of an OpenStack installation does not make a cloud. As an operator you will be dealing with many additional activities not all of which users see. These include infra onboarding, bootstrapping, remediation, config management, patching, packaging, upgrades, high availability, monitoring, metrics, user support, capacity forecasting and management, billing or chargeback, reclamation, security, firewalls, DNS, integration with other internal infrastructure and tools, and on and on and on. These activities are bound to consume a significant amount of time and effort. OpenStack gives some very key ingredients to build a cloud, but it is not cloud in a box.

The analyst made it clear that:

vendors get this difference, trust me.

Other insiders put the situation into similar terms:

OpenStack has some success stories, but dead projects tell no tales. I have seen no less than 100 Million USD spent on bad OpenStack implementations that will return little or have net negative value.Some of that has to be put on the ignorance and arrogance of some of the organizations spending that money, but OpenStack’s core competency, above all else, has been marketing and if not culpable, OpenStack has at least been complicit.

The motive behind the deception is clear. OpenStack is like giving someone a free Ferrari, pink slip and all but keeping the keys. You get pieces of a cloud but no way to run it. Once you have put all your effort into installing OpenStack and you realize what’s missing, you are welcome to turn to any one of the vendors backing OpenStack for one of their packaged cloud platforms.

OpenStack is a foot in the door. It’s a classic bait and switch but even after years, no one is admitting it. Instead, blue chip companies fight to steer OpenStack into the direction that suits them and their corporate offerings.

What’s great about SmartDataCenter?

It works.

The keys are in the ignition. You should probably stop reading this article and install it already. You are likely to get a promotion for listening to me 😉

Great Technology

SmartDataCenter was built on really great technologies like SmartOS (fork of Solaris), Zones, ZFS, and DTrace. Most of these technologies are slowly being ported to Linux but they are already 10 years mature in SDC.

  • Being based on a fork of Solaris brings you baked in enterprise ready features like IPSEC, IPF, RBAC, SMF, Resource management and capping, System auditing, Filesystem monitoring, etc.
  • Zones are the big daddy of container technology guaranteeing you the best on-metal performance for your cloud instances. If you are running a native SmartOS guest, you get the added benefit of CPU bursting and live machine re-size (no reboot, or machine pause necessary).
  • ZFS is the most reliable, high performance, file system in the world and is constantly improving.
  • DTrace is the secret to low level visibility with zero to no overhead. In cloud deployments where visibility is usually close to zero, this is an amazing feature. It’s even more amazing as the cloud operator.

Focus

SDC was built for one thing by one company, to replace the data centers of the past. It says so in the name. With one purpose, SDC has been built to be veryopinionated about what it does and how it does it. This gives SDC a tremendous amount of focus, something sorely lacking from would-be competition like OpenStack.

Lastly, it works.