Tag: Data management

Couchbase is Simply Awesome

couchbase

Here are five things that make Couchbase a go-to service in any architecture.

Couchbase is simple to setup.

Keep It Simple. It’s one of the axioms of system administration. Couchbase, though complicated under the hood, makes it very simple to setup even complicated clusters spanning multiple data centers.

Every node comes with a very user friendly web interface including the ability to monitor performance across all the nodes in the same machine’s cluster.

Adding nodes to a cluster is as simple as plugging in the address of the new node after which, all the data in the cluster is automatically rebalanced between the nodes. The same is true when removing nodes.

Couchbase is built to never require downtime which makes it a pleasure to work with.

If you are into automation a la chef, etc., Couchbase supports configuration via REST api. There are cookbooks available. I’m not sure about other configuration management tools but they probably have the relevant code bits as well.

Couchbase replaces Memcached

Even if you have no need for a more advanced NoSQL solution, there is a good chance you are using Memcached, Couchbase is the original Memcached on steroids.

Unlike traditional Memcached, Couchbase supports clustering, replication, and persistence of data. Using the Moxi Memcached proxy that comes with Couchbase, your apps can talk Memcached protocol to a cluster of Couchbase servers and get the benefits of automatic sharding and failover. If you want, Couchbase can also persist the Memcached data to disk turning your Memcached into a persistent, highly available key value store.

Couchbase is also a schema-less NoSQL DB

Aside from support for simple Memcached key/value storage, Couchbase is a highly available, easy to scale, JSON based DB with auto-sharding and built in map reduce.

Traditionally, Couchbase uses a system called views to perform complicated queries on the JSON data but they are also working on a new query language called N1QL which brings tremendous additional ad hoc query capabilities.

Couchbase also supports connectivity to Elastic Search, Hadoop, and Talend.

Couchbase is all about global scale out

Adding and removing nodes is simple and every node in a Couchbase cluster is read and write capable all the time. If you need more performance, you just add more nodes.

When one data center isn’t enough, Couchbase has a feature called cross data center replication (XDCR), letting you easily setup unidirectional or bidirectional replication between multiple Couchbase clusters over WAN. You can even setup full mesh replication though it isn’t clearly described in their documentation.

Unlike MongoDB, which can only have one master, Couchbase using XDCR allows apps in any data center to write to their local Couchbase cluster and that data will be replicated to all the other data centers.

I recently setup a system using five Couchbase clusters across the US and Europe, all connected in a full mesh with each other. In my experience, data written in any of the data centers updated across the globe in 1-2 seconds max.

Couchbase is only getting better

Having used Couchbase built from source (read community support only) since version 2.1 (Couchbase is now at 3.0.2), I can say that it is only getting better. They have made amazing progress with XDCR, added security functionality, and the N1QL language.

The Couchbase community is great. Checkout the IRC channel if you need help.

EMC Fully Automated Storage Tiering

Storage Tiering is nothing new. We use fast 15K RPM disks for high performance applications, slower 10K RPM disks for less demanding applications, and 7.2K RPM SATA disks for archive storage. Recently, solid state disks (SSDs) have also become more common for really high performance needs. The trick is managing it all.

Two or three years ago, if you wanted to implement automatic storage tiering, I would have pointed you in the direction of Sun’s Storage and Archive Manager- SAM and QFS, Sun’s tightly integrated shared file system. SAM-QFS automatically moves files from one storage tier to another based on the SAM policy and transparently retrieves the files when requested. With tape still the least expensive storage available, this is still a great solution for archiving petabytes of documents/files.

Unfortunately, SAM works at the file level so it will not help our databases run faster. What will help us is ZFS. ZFS is still making some fairly big waves in the storage community with it’s Hybrid Storage Pool feature. In a standard configuration, ZFS uses RAM for a Layer 1 read cache (ARC).  In advanced configurations, the zpool can be configured to use a Layer 2 cache (L2ARC) on faster disks ie. SSDs compared to SAS compared to SATA , etc. The zpool can also be configured to use separate, possibly faster disks for the ZFS Intent Log (ZIL) which is basically a write cache (without getting into why it is more than a write cache). Even without faster disks, the ability to store the read/write cache on a separate device can increase performance just by dedicating more IOPS to the cause.

Oracle/Sun’s 7000 series storage builds on the success of the ZFS Hybrid Storage Pool, using Logzilla devices for the ZIL and Readzilla devices for the L2ARC. With the powerful flash acceleration in the storage pool, even 7.2K RPM disks can give performance equal to that of higher speed 15K RPM disks.

Although ZFS does great things for performance by utilizing multiple tiers of storage devices, all the data is still physically stored on the same tier of storage in addition to having the hot data stored again in the caches. This is arguably a waste of capacity but can also lead to performance issues in some cases. For example, a cold L2ARC cache after reboot could give slower performance until fully warmed up. Oracle will probably fix this at some point by allowing the L2ARC to persist if stored on a non-volatile device (bug_id=6662467).

In the meantime, EMC recently announced an interesting new feature called FAST, short for Fully Automated Storage Tiering. FAST is available from FLARE version 04.30.000.5.004. FAST allows you to define a pool in the array composed of multiple RAID Groups, and then define a LUN on the pool as opposed to defining a LUN on the RAID Groups themselves. Once the LUN begins filling with data, the EMC will transparently begin transparently migrating data between the tiers of the pool in 1GB chunks, storing hot data on the fastest tiers and coldest data on the slowest tier.

FAST sounds like a dream come true. No more complicated storage configurations for the database. No more packages and processes to move historical data to slower disk groups. On the other hand, I am skeptical as to whether or not this technology is really mature. Do all EMC products treat the FAST LUNS the same as traditional LUNS (SnapView, Replication Manager, etc.) Also, are the ramifications of disk failures for a FAST LUN the same or does failure of a Tier 1 disk in a FAST pool mean alot more high performance eggs in one basket? Time will tell.

Real Time Reporting Databases

Reporting projects are the kind of projects which never seem to end. After a couple iterations I’ve come to the following conclusions:

  1. Absolutely no reports should run on a production database.
  2. Moving/aggregating data from a production database to a reporting database using ETL tools prone to synchronization issues and pretty unreliable.
  3. The best option is to set up real time replication of the data and build additional views on that.

Unfortunately, if you need to get data from heterogeneous databases, ie. Oracle, MySQL, SQL Server, etc. into a single reporting database, replication is not a simple solution. If you are running expensive database software in production, it may not be cost effective to run the same database for reporting.

Of course there are cross database replication solutions like Golden Gate or SharePlex but they are very expensive. I had already given up on getting data from Oracle into MySQL for reports when I stumbled across Tungsten Replicator.

According to the website, Tungsten Replicator provides open source database-neutral master/slave replication. Master/slave replication is a highly flexible technology that can solve a wide variety of problems including Cross DBMS Integration, ie. replication from Oracle to MySQL.

I’m looking forward to testing this product in the near future and I’d be happy to get anyone’s input if they’ve used it.