Category: Storage

Virtual Block Storage Crashed Your Cloud Again :(

darthvader

You know it’s bad when you start writing an incident report with the words “The first 12 hours.” You know you need a stiff drink, possibly a career change, when you follow that up with phrases like “this was going to be a lengthy outage…”, “the next 48 hours…”, and “as much as 3 days”.

That’s what happened to huge companies like NetFlix, Heroku, Reddit,Hootsuite, Foursquare, Quora, and Imgur the week of April 21, 2011. Amazon AWS went down for over 80 hours, leaving them and others up a creek without a paddle. The root cause of this cloud-tastrify echoed loud and clear.

Heroku said:

The biggest problem was our use of EBS drives, AWS’s persistent block storage solution… Block storage is not a cloud-friendly technology. EC2, S3, and other AWS services have grown much more stable, reliable, and performant over the four years we’ve been using them. EBS, unfortunately, has not improved much, and in fact has possibly gotten worse. Amazon employs some of the best infrastructure engineers in the world: if they can’t make it work, then probably no one can.

Reddit said:

Amazon had a failure of their EBS system, which is a data storage product they offer, at around 1:15am PDT. This may sound familiar, because it was the same type of failure that took us down a month ago. This time however the failure was more widespread and affected a much larger portion of our servers

While most companies made heartfelt resolutions to get off of EBS, NetFlix was clear to point out that they never trusted EBS to begin with:

When we re-designed for the cloud this Amazon failure was exactly the sort of issue that we wanted to be resilient to. Our architecture avoids using EBS as our main data storage service…

Fool me once…

As Reddit mentioned in their postmortem, AWS had similar EBS problems twice before on a smaller scale in March. After an additional 80+ hours of downtime, you would expect companies to learn their lesson, but the facts are that these same outages continue to plague clouds using various types of virtual block storage.

In July 2012, AWS experienced a power failure which resulted in a huge number of possibly inconsistent EBS volumes and an overloaded control plane. Some customers experienced almost 24 hours of downtime.

Heroku, under the gun again, said:

Approximately 30% of our EC2 instances, which were responsible for running applications, databases and supporting infrastructure (including some components specific to the Bamboo stack), went offline…
A large number of EBS volumes, which stored data for Heroku Postgres services, went offline and their data was potentially corrupted…
20% of production databases experienced up to 7 hours of downtime. A further 8% experienced an additional 10 hours of downtime (up to 17 hours total). Some Beta and shared databases were offline for a further 6 hours (up to 23 hours total).

AppHarbor had similar problems:

EC2 instances and EBS volumes were unavailable and some EBS volumes became corrupted…
Unfortunately, many instances were restored without associated EBS volumes required for correct operation. When volumes did become available they would often take a long time to properly attach to EC2 instances or refuse to attach altogether. Other EBS volumes became available in a corrupted state and had to be checked for errors before they could be used.
…a software bug prevented this fail-over from happening for a small subset of multi-az RDS instances. Some AppHarbor MySQL databases were located on an RDS instance affected by this problem.

The saga continues for AWS who continued to have problems with EBS later in 2012. They detail ad nauseam, how a small DNS misconfiguration triggered a memory leak which caused a silent cascading failure of all the EBS servers. As usual, the EBS failures impacted API access and RDS services. Yet again Multi-AZ RDS instances didn’t failover automatically.

Who’s using Virtual Block Storage?

Amazon EBS is just one very common example of Virtual Block Storage and by no means, the only one to fail miserably.

Azure stores the block devices for all their compute nodes as Blobs in their premium or standard storage services. Back in November, a bad update to the storage service sent some of their storage endpoints into infinite loops, denying access to many of these virtual hard disks. The bad software was deployed globally and caused more than 10 hours of downtime across 12 data centers. According to the post, some customers were still being affected as much as three days later.

HP Cloud provides virtual block storage based on OpenStack Cinder. See related incident reports here, here, here, here, here. I could keep going back in time, but I think you get the point.

Also based on Cinder, Rackspace offers their Cloud Block Storage product. Their solution has some proprietary component they call Lunr, as detailed in this Hacker News thread so you can hope that Lunr is more reliable than other implementations. Still, Rackspace had major capacity issues spanning over two weeks back in May of last year and I shudder to think what would have happened if anything went wrong while capacity was low.

Storage issues are so common and take so long to recover from in OpenStack deployments, that companies are deploying alternate cloud platforms as a workaround while their OpenStack clouds are down.

What clouds won’t ruin your SLA?

Rackspace doesn’t force you to use their Cloud Block Storage, at least not yet, so unless they are drinking their own kool-aid in ways they shouldn’t be, you are hopefully safe there.

Digital Ocean also relies on local block storage by design. They are apparently considering other options but want to avoid an EBS-like solution for the reasons I’ve mentioned. While their local storage isn’t putting you at risk of a cascading failure, they have been reported to leak your data to other customers if you don’t destroy your machines carefully. They also have other fairly frequent issueswhich take them out of the running for me.

The winning horse

As usual, Joyent shines through on this. For many reasons, the SmartDataCenter platform, behind both their public cloud and open source private cloud solutions, supports only local block storage. For centralized storage, you can use NFS or CIFS if you really need to but you will not find virtual block storage or even SAN support.

Joyent gets some flack for this opinionated architecture, occasionally even from me, but they don’t corrupt my data or crash my servers because some virtual hard disk has gone away or some software upgrade has been foolishly deployed.

With their recently released Docker and Linux Binary support, Joyent is really leading the pack with on-metal performance and availability. I definitely recommend hitching your wagon to their horse.

The Nooooooooooooooo! button

If it’s too late and you’re only finding this article post cloud-tastrify, I refer you to the ever amusing Nooooooooooooooo! button for some comic relief.

EMC Fully Automated Storage Tiering

Storage Tiering is nothing new. We use fast 15K RPM disks for high performance applications, slower 10K RPM disks for less demanding applications, and 7.2K RPM SATA disks for archive storage. Recently, solid state disks (SSDs) have also become more common for really high performance needs. The trick is managing it all.

Two or three years ago, if you wanted to implement automatic storage tiering, I would have pointed you in the direction of Sun’s Storage and Archive Manager- SAM and QFS, Sun’s tightly integrated shared file system. SAM-QFS automatically moves files from one storage tier to another based on the SAM policy and transparently retrieves the files when requested. With tape still the least expensive storage available, this is still a great solution for archiving petabytes of documents/files.

Unfortunately, SAM works at the file level so it will not help our databases run faster. What will help us is ZFS. ZFS is still making some fairly big waves in the storage community with it’s Hybrid Storage Pool feature. In a standard configuration, ZFS uses RAM for a Layer 1 read cache (ARC).  In advanced configurations, the zpool can be configured to use a Layer 2 cache (L2ARC) on faster disks ie. SSDs compared to SAS compared to SATA , etc. The zpool can also be configured to use separate, possibly faster disks for the ZFS Intent Log (ZIL) which is basically a write cache (without getting into why it is more than a write cache). Even without faster disks, the ability to store the read/write cache on a separate device can increase performance just by dedicating more IOPS to the cause.

Oracle/Sun’s 7000 series storage builds on the success of the ZFS Hybrid Storage Pool, using Logzilla devices for the ZIL and Readzilla devices for the L2ARC. With the powerful flash acceleration in the storage pool, even 7.2K RPM disks can give performance equal to that of higher speed 15K RPM disks.

Although ZFS does great things for performance by utilizing multiple tiers of storage devices, all the data is still physically stored on the same tier of storage in addition to having the hot data stored again in the caches. This is arguably a waste of capacity but can also lead to performance issues in some cases. For example, a cold L2ARC cache after reboot could give slower performance until fully warmed up. Oracle will probably fix this at some point by allowing the L2ARC to persist if stored on a non-volatile device (bug_id=6662467).

In the meantime, EMC recently announced an interesting new feature called FAST, short for Fully Automated Storage Tiering. FAST is available from FLARE version 04.30.000.5.004. FAST allows you to define a pool in the array composed of multiple RAID Groups, and then define a LUN on the pool as opposed to defining a LUN on the RAID Groups themselves. Once the LUN begins filling with data, the EMC will transparently begin transparently migrating data between the tiers of the pool in 1GB chunks, storing hot data on the fastest tiers and coldest data on the slowest tier.

FAST sounds like a dream come true. No more complicated storage configurations for the database. No more packages and processes to move historical data to slower disk groups. On the other hand, I am skeptical as to whether or not this technology is really mature. Do all EMC products treat the FAST LUNS the same as traditional LUNS (SnapView, Replication Manager, etc.) Also, are the ramifications of disk failures for a FAST LUN the same or does failure of a Tier 1 disk in a FAST pool mean alot more high performance eggs in one basket? Time will tell.

No ZFS Support for EMC Replication Manager

As I originally blogged, I was hoping to use EMC snapshots to perform server-less/network-less backups. EMC provides two main tools for managing snapshots in this type of situation:

  • EMC Replication Manager
  • EMC PowerSnap Networker Module

The PowerSnap Module supposedly automates taking snapshots for the purpose of backups, while Replication Manager supposedly provides a much more robust package.

With Replication Manager you might create a policy to take a snapshot every five minutes, keep the last 10, and use those for backups whenever necessary.

To make a long story short, Replication Manager is useless for LUNs with ZFS. According to EMC, this won’t change in the near future. PowerSnap also has no support for taking snapshots of LUNs with ZFS on them so basically EMC has no server-less backup offerings for Solaris with ZFS.

As an IT guy in general, ZFS is the best thing that has happened to file systems in the last 10 years and it is only getting better. ZFS is already standard in FreeBSD and NetBSD. Linux supports ZFS over FUSE due to license issues but I’m confident those will be solved. The file system is platform independent, meaning you can move the data transparently between Intel and Sparc architectures. Deduplication has just been added to the feature set and disk encryption is on it’s way.

As a Solaris admin, I really can’t figure out why EMC would decide to cut off their own foot like this. It is clear that UFS will remain for legacy and backwards compatibility but ZFS is the future. Not planning to support ZFS is like not planning to support Solaris.

The only possibility that I can see is that EMC sees Sun, Solaris, and ZFS as enough of a threat, that they are strategically trying to limit options? For operations local to a server, ZFS has largely replaced the need for heavy hardware like EMC on the SAN. Some would argue that ZFS RAID + JBOD is better than ZFS + RAID on EMC. You can do the snapshots without the EMC. On a simple level, you can send snapshots asynchronously to another system, similar to MirrorView, without the EMC. You can do deduplication without the EMC. Now with Sun’s Flash Cache technology which integrates with ZFS, you can get the performance without the EMC. Along the same lines, you see Sun changing the rules of the storage/database game with solutions like Exadata V2. The integration of Zones with ZFS may be challenging Vmware on the virtualization front, especially with the serious advantage Sun’s Coolthreads servers have in terms of consolidation.

That said, I still prefer to offload this work to dedicated storage hardware for the time being and probably in the future. If EMC chooses not to support ZFS, they will only force us not to buy EMC arrays. We will stop buying disks, stop buying tools, etc.

Instead, they should be providing better support for ZFS, integrating with ZFS to get better performance, providing tools which make EMC the preferred disk array behind a ZFS filesystem.

EMC Replication Manager in Solaris

UPDATE: No ZFS Support for Replication Manager in the near future

Using storage level snapshots can be used to run backups without directly requiring resources from the original host.

EMC Replication Manager coordinates the creation of application consistent snapshots across all the hosts in your network. It handles scheduling creation/expiration of snapshots,  mounting and unmounting from backup servers, etc. from a single console.

Although it is not tightly integrated into EMC Networker like the similar Networker PowerSnap module, it can be used to start a backup process after taking a new snapshot and it has the capability to manage snapshots unrelated to backups from a GUI.

While the data sheet claims support for Solaris, there are several caveats which I have run into.

  1. There is no mention of ZFS support in the data sheet and apparently, there is no support in the software either. One would expect this to be a non-question since ZFS has been part of Solaris since 2006.
  2. The data sheet is missing the word “SPARC” next to the word Solaris. There is no support for x86.

Honestly, this has put a dent in my plans since my backup server is an x86 box. I’m hoping the lack of ZFS support will work out as long as we can script any FS specific magic we need. I don’t have an option of running something like Linux on it (just to get the software working) because I won’t be able to even mount the ZFS filesystems- let alone back them up.

In the meantime, I’ll have to move my backups to a SPARC server and considering the lack of low end SPARC machines, I’ll have to allocate something way too expensive to be a backup server.

Caveats on Using Snapshots for Server-less Backups

Whether you are dealing with disk I/O in reading the data from the disks, or CPU for compressing or encrypting the data (or both- remember to compress and then encrypt!), or network for transferring the data to a backup server, the added load of a backup on your production servers is unwelcome. For this reason, the period of time during which backups can be made, aka. backup window, may be limited- even severely.

You may say, “It only takes me X hours to do a full backup of everything”, but over time backup windows are notorious for becoming too small. Backups are split over multiple days, technologies upgraded, etc. When planning a backup strategy, my approach is to eliminate the backup window altogether- that is do whatever you can to take the backup off the production hardware altogether.

Storage Snapshots are one method for taking the production servers out of the backup equation. By creating a consistent, point in time snapshot on your storage, and mounting it on your backup server, you can backup your data using your backup server’s resources while your production servers continue as usual.

Caveats of this method in general are:

  1. Most snapshot technologies are some form of “Copy On Write”. This means that after you take a snapshot, the data from any area written to on the disks will first be copied somewhere else for safe keeping and then be overwritten.
    • This may cause a performance hit on your production system as you are generating extra IO on every write.
    • As long as the data being used in production has not changed significantly from the snapshot, your backups will still be sending the majority of their read operations to the same physical disks being used by production so this doesn’t relieve the backup load on the storage as much as it relieves the load on the servers.
  2. Key word is “consistent”.
    • You do not want to be where KDE developers were when ext4 was released. Depending on the applications or systems you are trying to backup, you may need to “quiet” them (FLUSH TABLES WITH READ LOCK,  ALTER TABLESPACE <tablespacename> BEGIN BACKUP, etc.)
    • If your application, ie. Oracle Database uses Datafiles or ASM spread over several LUNs, then all your storage level snapshots probably need to be taken together in order for the DB itself to remain consistent. For more, look at “Consistency Groups.”
  3. Once you have the snapshot, your backup server needs to see the snapshot LUN, and be able to mount the filesystem on the LUN. If your backup server doesn’t run the same operating system as your production servers this may be an issue. Ie. Try convincing a Windows server to mount a ZFS Pool (I dare you).

Anyway- these are just some things to look out for when you want to use storage level snapshots to backup servers without loading the production systems themselves. In another post I’ll touch on some EMC infrastructure specifics to look out for.