Tag: clariion

EMC Fully Automated Storage Tiering

Storage Tiering is nothing new. We use fast 15K RPM disks for high performance applications, slower 10K RPM disks for less demanding applications, and 7.2K RPM SATA disks for archive storage. Recently, solid state disks (SSDs) have also become more common for really high performance needs. The trick is managing it all.

Two or three years ago, if you wanted to implement automatic storage tiering, I would have pointed you in the direction of Sun’s Storage and Archive Manager- SAM and QFS, Sun’s tightly integrated shared file system. SAM-QFS automatically moves files from one storage tier to another based on the SAM policy and transparently retrieves the files when requested. With tape still the least expensive storage available, this is still a great solution for archiving petabytes of documents/files.

Unfortunately, SAM works at the file level so it will not help our databases run faster. What will help us is ZFS. ZFS is still making some fairly big waves in the storage community with it’s Hybrid Storage Pool feature. In a standard configuration, ZFS uses RAM for a Layer 1 read cache (ARC).  In advanced configurations, the zpool can be configured to use a Layer 2 cache (L2ARC) on faster disks ie. SSDs compared to SAS compared to SATA , etc. The zpool can also be configured to use separate, possibly faster disks for the ZFS Intent Log (ZIL) which is basically a write cache (without getting into why it is more than a write cache). Even without faster disks, the ability to store the read/write cache on a separate device can increase performance just by dedicating more IOPS to the cause.

Oracle/Sun’s 7000 series storage builds on the success of the ZFS Hybrid Storage Pool, using Logzilla devices for the ZIL and Readzilla devices for the L2ARC. With the powerful flash acceleration in the storage pool, even 7.2K RPM disks can give performance equal to that of higher speed 15K RPM disks.

Although ZFS does great things for performance by utilizing multiple tiers of storage devices, all the data is still physically stored on the same tier of storage in addition to having the hot data stored again in the caches. This is arguably a waste of capacity but can also lead to performance issues in some cases. For example, a cold L2ARC cache after reboot could give slower performance until fully warmed up. Oracle will probably fix this at some point by allowing the L2ARC to persist if stored on a non-volatile device (bug_id=6662467).

In the meantime, EMC recently announced an interesting new feature called FAST, short for Fully Automated Storage Tiering. FAST is available from FLARE version 04.30.000.5.004. FAST allows you to define a pool in the array composed of multiple RAID Groups, and then define a LUN on the pool as opposed to defining a LUN on the RAID Groups themselves. Once the LUN begins filling with data, the EMC will transparently begin transparently migrating data between the tiers of the pool in 1GB chunks, storing hot data on the fastest tiers and coldest data on the slowest tier.

FAST sounds like a dream come true. No more complicated storage configurations for the database. No more packages and processes to move historical data to slower disk groups. On the other hand, I am skeptical as to whether or not this technology is really mature. Do all EMC products treat the FAST LUNS the same as traditional LUNS (SnapView, Replication Manager, etc.) Also, are the ramifications of disk failures for a FAST LUN the same or does failure of a Tier 1 disk in a FAST pool mean alot more high performance eggs in one basket? Time will tell.

EMC Replication Manager in Solaris

UPDATE: No ZFS Support for Replication Manager in the near future

Using storage level snapshots can be used to run backups without directly requiring resources from the original host.

EMC Replication Manager coordinates the creation of application consistent snapshots across all the hosts in your network. It handles scheduling creation/expiration of snapshots,  mounting and unmounting from backup servers, etc. from a single console.

Although it is not tightly integrated into EMC Networker like the similar Networker PowerSnap module, it can be used to start a backup process after taking a new snapshot and it has the capability to manage snapshots unrelated to backups from a GUI.

While the data sheet claims support for Solaris, there are several caveats which I have run into.

  1. There is no mention of ZFS support in the data sheet and apparently, there is no support in the software either. One would expect this to be a non-question since ZFS has been part of Solaris since 2006.
  2. The data sheet is missing the word “SPARC” next to the word Solaris. There is no support for x86.

Honestly, this has put a dent in my plans since my backup server is an x86 box. I’m hoping the lack of ZFS support will work out as long as we can script any FS specific magic we need. I don’t have an option of running something like Linux on it (just to get the software working) because I won’t be able to even mount the ZFS filesystems- let alone back them up.

In the meantime, I’ll have to move my backups to a SPARC server and considering the lack of low end SPARC machines, I’ll have to allocate something way too expensive to be a backup server.

When 99.999% Isn’t Good Enough

When discussing availability of a service, it is common to hear the term “Five Nines” referring to a service being available 99.999% of the time but “Five Nines” are relative. If your time frame is a week, then your service can be unavailable for 6.05 seconds whereas a time frame of a year, allows for a very respectable 5.26 minutes.

In reality, none of those calculations are relevant because no one cares if a service is unavailable for 10 hours, as long as they aren’t trying to use it. On the other hand, if you’re handling 50,000 transactions per second, 6.05 seconds of unavailability could cost you 302,500 transactions and no one cares if you met your SLA.

This problem is one I’ve come up against a number of times in the past and recently even more and the issue is orders of magnitude in IT. The larger the volume of business you handle, the less relevant the Five Nines become.

Google became famous years ago for its novel approach to hardware availability. They were using servers and disks on such a scale that they could no longer prevent the failures and they decided not to even try. Instead, they planned to sustain lots of failures and made a business of knowing when to expect problems and where. As much as we would like to be able to take Google’s approach to things, I think most of our IT budgets aren’t up for it.

Another good example is EMC2 who boast 99.999% availability for their Clariion line of storage systems. I want to start by saying that I use EMC storage and I’m happy with them. Regardless, their claim of 99.999% availability doesn’t give me any comfort for the following reasons.

According to a Whitepaper from 2007 (maybe they have changed things since then) EMC has a team which calculates availability for every Clariion in the field on a weekly basis. Assuming there were 2000 Clariion systems in the field on a given week(the example given in the whitepaper), and across all of them was 1.5 hours of downtime, then:

2000 systems x 7 days x 24 hours   =  336,000 total hours of runtime
336,000 hours - 1.5 hours downtime =  335,998.5 hours of uptime
335,998.5 / 336,000                =  99.9996% uptime

That is great, at least that is what EMC wants you to think. I look at this and understand something totally different. According to this guy, as of the beginning of 2009 there were 300,000 Clariion’s sold- not 2000. That is two orders of magnitude different meaning:

300,000 systems x 7 days x 24 hours   =  50,400,000 total hours of runtime
336,000 hours - 504 hours downtime    =  50,399,496 hours of uptime
50,399,496 / 50,400,000               =  99.999% uptime

Granted, that is a lot of uptime but 504 hours of downtime is still 21 full days of downtime for someone. If it were possible for 21 full days of downtime to fit in one week, they could all be yours and EMC would still be able to claim 99.999% availability according to their calculations. By the same token, 3 EMC customers each week could theoretically have no availability the entire week and one of those customers could be me.

Since storage failures can cause soo many complications, I figure it is much more likely that EMC downtime comes in days as opposed to minutes or hours. Either way, Five Nines is lost in the scale of things in this case as well.

Content Delivery Networks provide another availability vs scale problem. Akamai announced record breaking amounts of traffic on their network in January 2009. They passed 2 terabits and 12,000,000 requests per second. (I don’t use Akamai but I think it is amazing that they delivered over 2 terabits/second of traffic). With that level of traffic, even if Akamai would provide a 99.999% availability SLA, they could have had 120 failed requests per second, 7200 failed requests per minute, etc.

Sometimes complaints relating to our CDN cross my desk and while I have no idea how much traffic our CDN handles world wide, I know that we can easily send it 20,000,000 requests per day. Assuming 99.999% availability, I expect (learning from Google) to have 200 failed requests per day. Knowing IT as I do, I also expect that all 200 failed requests will be in the same country -probably an issue with one of their cache servers which due to GTM will primarily affect people directed to that server, etc. Unfortunately, the issue of scale is lost on our partners who didn’t get their content.

Availability is not the only case where scale is forgotten. I was recently asked to help debug the performance of an application server which could handle a large amount of requests per second when queried directly but only handled 80% of the requests per second when sitting behind a load balancer.

Of course we started by trying to find a reason why the load balancer would be causing a 20% performance hit. After deep investigation the answer I found (not necessarily the correct answer) was that all the load balancing configurations were correct and on average having the load balancer in the path added 1 millisecond to the response time of each request. Unfortunately the response time without the load balancer was an average of 4 milliseconds, so the additional 1 millisecond reduced the overal performance by 20%.

In short, everything is relative and 99.999% isn’t good enough.

RAID 10 vs RAID 5: Performance, Cost, Space, and HA

DISCLAIMER: I am not a SAN storage expert but I have spent a lot of time looking into SAN storage systems from the business side and I thought I’d share some of my conclusions.

It seems that the proverbial question is how to balance the performance, cost, usable space, and availability of a storage solution. Any DBA will ask you to give him RAID 10 on small fast disks. Anyone paying the bills will ask “Why can’t I use half the disks I bought?”

I took a couple hours with your friendly neighborhood spreadsheet and did the math. I base my calculations on EMC Clariion storage and tried to follow the EMC best practices guide as much as possible.

According to the best practices, I started my calculations based on a necessary performance level consisting of total IOPS, read percentage, and write percentage.
Then, using the following formulas, I calculate the actual disk IOPS required to provide the requested performance:

  • RAID 5 (4+1 Groups)
    Disk IOPS = (Read % * Required IOPS) +
                    (Write % * RAID5 write penalty * Required IOPS)
  • RAID 10
    Disk IOPS = (Read % * Required IOPS) +
                    (Write % * RAID10 write penalty * Required IOPS)

The RAID 5 write penalty in a 4+1 RAID group is 4 while the RAID 10 write penalty is 2.
Before you even put this in a spreadsheet you know what it will tell you-

  • In a 100% Read Only environment RAID 5 and RAID 10 will give the same performance. RAID 5 may use less disks to do it but not necessarily.
  • In a 100% Write Only environment, RAID 5 will require twice as many disk IOPS and almost twice the number of disks.
  • Anywhere in between those two extremes, the more writes required, the less number of RAID 10 disks you will need to achieve the performance.

If we stop there, it doesn’t seem like there is any point in using RAID 5 since even in the best case scenario, there is only a partial chance that we will use less disks. That is where the cost and space effectiveness issues come in.

  • Space Effective Storage Allocation

If I want 2000 IOPS, 100% Read Only, I can do that using 15 x 146GB 15k RPM disks in RAID 5 or in RAID 10. In RAID 5 I will get ~1.5TB net space while in RAID 10 I will get ~1TB.

  • Cost Effective Storage Allocation

So far, we have compared different RAID types using the same size and speed disks and we saw that theoretically we can use less disks to reach the same performance but at the expense of usable disk space.

If we use bigger disks for the RAID 10, does it make up for the lost space? What effect does using RAID 10 with fewer large disks as opposed to RAID 5 with lots of smaller disks have on the cost of my solution?

That brings us back to the spreadsheet. Using the required disk IOPS we can figure out the required number of physical disks of each type. For the sake of comparison I use the following information which I found on the Internet (your mileage may vary):

  • 146GB 4GbFC 15k RPM, 140 IOPS, $1256
  • 300GB 4GbFC 10k RPM, 120 IOPS, $1348
  • 1TB 4Gb SATA II 7.2k RPM, 80 IOPS, $2088

For each of these I calculate the minimum number of physical disks required for to reach the required IOPS with the required read/write profile for both RAID 10 and RAID 5. Then I figure in the RAID group sizes and calculated the usable disk space.

Using the prices above, I calculate the price per TB of disk space in each RAID configuration and find:

  • 146GB, RAID 5 (4+1): $11.91K/TB
  • 300GB, RAID 5 (4+1): $6.35K/TB
  • 1TB, RAID 5 (4+1): $2.87K/TB
  • 146GB, RAID 10 (4+1): $19.01K/TB
  • 300GB, RAID 10: $10.15K/TB
  • 1TB, RAID 10: $4.59K/TB

What is really interesting here is how close the 300GB RAID 10 is to the 146GB RAID 5! Is this a coincidence?

Looking at the IOPS/TB relationship and $K/IOPS, we find that the ratios are dependant on the read/write profile of the required IOPS. Given the similar Price/TB of 300GB RAID 10 and 146GB RAID 5, I look there for a price/performance/disk space sweet spot.

The following table shows the difference between 146GB RAID 5 IOPS/TB and 300GB RAID 10 IOPS/TB.
Each column represents a different Read percentage (the Write percentage is the inverse).
Negative numbers mean that for this Read percentage and IOPS requirement, RAID 10 gives more IOPS/TB of disk. Positive numbers mean that RAID 5 gives better IOPS/TB.

What you see from this is that for any read workload under 70%, you will get more IOPS/TB from 300GB 10k RPM disks using RAID 10 than you will with RAID 5 on 146GB 15k RPM disks.
Even if you hit 80%, RAID 5 will gain less than 100 IOPS over the RAID 10 configuration and you are still better off paying less for your disks- let the cache do it’s job. Combine all this with our previous conclusion – that the 300GB RAID 10 configuration is ~$1.75K less expensive per TB and I say you have a winner.