Tag Archive for sun

Vendor Lock-In or One Stop Shop

I was recently discussing load balancers with someone. I said I was much happier with F5 than I was with Cisco and he countered that although he preferred F5 head to head, going with Cisco for all the network was better for them in the long run.

The situation with storage is similar. EMC makes a great SAN but a pretty bad NAS. Is it worth getting EMC”s NAS for the One Stop Shop factor?

Since Oracle’s acquisition of Sun, I’ve been looking forward to the success of their “One Stop Shop” philosophy. Successfully bringing all their offerings under one roof promises better and faster support all around.

Unfortunately, it has been almost a year and Oracle is still not sure how they are to unify the customer support systems. New support contracts don’t work in either system.  To make things a little less clear, Oracle recently announced that everything will be migrated to “My Oracle Support” but they don’t know when- very reassuring.

A simple pattern emerges. One Stop Shop is a dream for IT people. Support is hard enough to get when you’ve isolated a problem to a specific vendor. It is even harder when your problems are between two vendors and each points the finger at the other.

When does the One Stop Shop strategy become a rationalization for Vendor Lock-In? It is a delicate balance around how much better your IT could be with Best of Breed vs. how much worse they will be integrating all the different pieces of the puzzle.

Regarding Cisco vs. F5, I’m also pretty happy letting Cisco handle everything Layer 3 and under and I don’t worry too much about the integration. I’m also optimistic regarding Sun and Oracle. I think they’ll have the wrinkles ironed out by the second half of 2011. If they don’t, it will be a serious let down.

EMC Fully Automated Storage Tiering

Storage Tiering is nothing new. We use fast 15K RPM disks for high performance applications, slower 10K RPM disks for less demanding applications, and 7.2K RPM SATA disks for archive storage. Recently, solid state disks (SSDs) have also become more common for really high performance needs. The trick is managing it all.

Two or three years ago, if you wanted to implement automatic storage tiering, I would have pointed you in the direction of Sun’s Storage and Archive Manager- SAM and QFS, Sun’s tightly integrated shared file system. SAM-QFS automatically moves files from one storage tier to another based on the SAM policy and transparently retrieves the files when requested. With tape still the least expensive storage available, this is still a great solution for archiving petabytes of documents/files.

Unfortunately, SAM works at the file level so it will not help our databases run faster. What will help us is ZFS. ZFS is still making some fairly big waves in the storage community with it’s Hybrid Storage Pool feature. In a standard configuration, ZFS uses RAM for a Layer 1 read cache (ARC).  In advanced configurations, the zpool can be configured to use a Layer 2 cache (L2ARC) on faster disks ie. SSDs compared to SAS compared to SATA , etc. The zpool can also be configured to use separate, possibly faster disks for the ZFS Intent Log (ZIL) which is basically a write cache (without getting into why it is more than a write cache). Even without faster disks, the ability to store the read/write cache on a separate device can increase performance just by dedicating more IOPS to the cause.

Oracle/Sun’s 7000 series storage builds on the success of the ZFS Hybrid Storage Pool, using Logzilla devices for the ZIL and Readzilla devices for the L2ARC. With the powerful flash acceleration in the storage pool, even 7.2K RPM disks can give performance equal to that of higher speed 15K RPM disks.

Although ZFS does great things for performance by utilizing multiple tiers of storage devices, all the data is still physically stored on the same tier of storage in addition to having the hot data stored again in the caches. This is arguably a waste of capacity but can also lead to performance issues in some cases. For example, a cold L2ARC cache after reboot could give slower performance until fully warmed up. Oracle will probably fix this at some point by allowing the L2ARC to persist if stored on a non-volatile device (bug_id=6662467).

In the meantime, EMC recently announced an interesting new feature called FAST, short for Fully Automated Storage Tiering. FAST is available from FLARE version 04.30.000.5.004. FAST allows you to define a pool in the array composed of multiple RAID Groups, and then define a LUN on the pool as opposed to defining a LUN on the RAID Groups themselves. Once the LUN begins filling with data, the EMC will transparently begin transparently migrating data between the tiers of the pool in 1GB chunks, storing hot data on the fastest tiers and coldest data on the slowest tier.

FAST sounds like a dream come true. No more complicated storage configurations for the database. No more packages and processes to move historical data to slower disk groups. On the other hand, I am skeptical as to whether or not this technology is really mature. Do all EMC products treat the FAST LUNS the same as traditional LUNS (SnapView, Replication Manager, etc.) Also, are the ramifications of disk failures for a FAST LUN the same or does failure of a Tier 1 disk in a FAST pool mean alot more high performance eggs in one basket? Time will tell.

SUNOS-8000-1L Errors caused by nxge driver for X4447A-z

I recently installed Solaris 08/07 on a T2000 with a Sun Quad GbE x8 PCIe Low Profile Adapter (X4447A-z) inside. The machine gave lots of problems.

One of the issues was the following message which the machine logged hundreds if not thousands of times:

Oct 23 22:18:27 hostname fmd: [ID 441519 daemon.error] SUNW-MSG-ID: SUNOS-8000-1L,
TYPE: Defect, VER: 1, SEVERITY: Minor
Oct 23 22:18:27 hostname EVENT-TIME: Tue Oct 23 22:18:27 BST 2007
Oct 23 22:18:27 hostname PLATFORM: SUNW,Sun-Fire-T200, CSN: -, HOSTNAME: hostname
Oct 23 22:18:27 hostname SOURCE: eft, REV: 1.16
Oct 23 22:18:27 hostname EVENT-ID: 86cc16cc-a356-6a94-a11b-bbc8cd5e456f
Oct 23 22:18:27 hostname DESC: The EFT Diagnosis Engine encountered telemetry
for which it is unable to produce a diagnosis. Refer to
http://sun.com/msg/SUNOS-8000-1L for more information.
Oct 23 22:18:27 hostname AUTO-RESPONSE: Error reports from the component will be
logged for examination by Sun.
Oct 23 22:18:27 hostname IMPACT: Automated diagnosis and response for these
events will not occur.
Oct 23 22:18:27 hostname REC-ACTION: Run pkgchk -n SUNWfmd to ensure that
fault management software is installed properly. Contact Sun for support.

I originally assumed that these very descriptive messages were part of the same problem with the fmd service which I mentioned in a previous post but Sun found another source for the problem. Apparently it is the nxge driver.
As I write this entry, Sun is working on a new driver. They tried a test version on my server and it did not solve the problem but it does seem to lessen the number of errors and add some information to the logs specifically, the entries above are sometimes preceded by a line similar to this:

nxge: [ID 752849 kern.warning] WARNING: nxge2 : nxge_ipp_err_evnts: pkt_dis_max

In the meantime, it seems that I will be ditching the quad cards until Sun can get their act together. I’m getting them replaced by two dual gigabit cards which use the e1000g driver.