Tag Archive for Cloud computing

How To Setup a Multi-Platform Website With Cedexis

multiplatform

With the recent and ongoing DDOS attacks against Github, many sites hosted as Github Pages have been scrambling to find alternative hosting. I began writing this tutorial long before the attacks but the configuration you find here is exactly what you need to serve static content from multiple providers like Github Pages, Divshot, Amazon S3, etc.

In my previous article, I introduced the major components of Cedexis and how they fit together to build great multi-provider solutions. If you haven’t read it, I suggest taking a quick look to get familiar with the concepts.

In this article, I’ll show you, step-by-step, how to build a robust multi-provider, multi-platform website using the following Cedexis components:

  • Radar- provides real-time user measurements of provider performance and availability.
  • Sonar- provides lightweight efficient service availability health-checks.
  • OpenMix- DNS traffic routing based on the data from Radar and Sonar.
  • Portal- UI for configuration and reporting.

I’ll also briefly cover some of the reports available in the portal.

Basic Architecture

OpenMix applications all have the same basic architecture and the same basic building blocks. Here is a diagram for reference:

Preparations

For the purpose of the tutorial, I’ve built a demo site using the Amazon S3 and Divshot static web hosting services. I’ve already uploaded my content to both providers and made sure that they are responding.

Both of these services provide us a DNS hostname with which to access the sites.

Amazon S3 is part of the standard community Radar measurements, but as a recently launched service, Divshot hasn’t made the list yet. By adding test objects, we can eventually enable private Radar measurements instead.

Download the test objects from here:

I’ve uploaded them to a cedexis directory inside my web root on Divshot.

Configuring the Platforms

Platforms are, essentially, named data sets. Inside the OpenMix Application, we assign them to a service endpoint and the data they contain influences how Cedexis routes traffic. In addition, the platforms become dimensions by which we slice and dice data in the reports.

We need to define platforms for S3 and Divshot in Cedexis and connect each platform to their relevant data sources (Radar and Sonar).

Log in to the Cedexis portal here and click on the Add Platform button in the Platforms section.

 

 

We’ll find the community platform for Amazon S3 under the Cloud Storage category. It means that S3 performance will be monitored automatically by Cedexis’ Community Radar. You can leave the defaults on this screen:

After clicking next, we’ll get the opportunity to set up custom Radar and Sonar settings for this platform. We want to enable Sonar to make sure there are no problems with our specific S3 bucket which community Radar might not catch.

We’ll enable Sonar polls every 60 seconds (default) and for the test URL, I’ve put the homepage of the site.

We’ll save the platform and create another:

Divshot is somewhere in between Cloud Storage and Cloud Compute. It’s really only hosting static content so I’ve chosen the Cloud Storage category, but there is no real difference from Cedexis’ perspective. If they eventually add Divshot to their community metrics, it might end up in a different category.

Since Divshot isn’t one of the pre-configured Cloud Storage platforms, choose the platform “Other”.

The report name is what will show up in Cedexis charts when you want to analyze the data from this platform.

The OpenMix alias is how OpenMix applications will refer to this platform. Notice that I’ve called it divshot_production. That is because Divshot provides multiple environments for development, staging, and QA. In the future, we may define platforms for other environments as well.

Since there are no community Radar measurements for Divshot, we prepare private measurements of our own in the next step.

We are going to add three types of probes using the test objects which we downloaded above.

First the HTTP Response Time URL (If you are serving your content over SSL, as you should, then you should choose the HTTPS Response Time URL probe, etc.) The HTTP Response Time probe should use the Small Javascript Timing Object.

Click Add probe at the bottom left of the dialog to add the next probes.

In addition to the response time probe, we will add the Cold Start and Throughput probes to cover all our bases.

The Cold Start probe should also use the small test object.

The Throughput probe needs the large test object.

Make sure the test objects are correct in the summary page before setting up Sonar.

Configure the Sonar settings for Divshot similarly to those from S3 with the exception of using the homepage from Divshot for the test URL. Then click ‘Complete’ to save the platform and we are done setting up platforms for now.

A little bit of information about platforms. A nice thing about platforms is that they are decoupled from the OpenMix applications. That means that you can re-use a platform across multiple OpenMix applications with completely different logic. It also means you can slice and dice your data using application and platform as separate dimensions.

For example, if we had applications running in multiple datacenters, we would be interested to know about the performance and availability of each data center across all our applications. Conversely, we would also want to know if a specific application performs better in one data center than another. Cedexis hands us this data on a silver platter.

Our First OpenMix Application

Open the Application Configuration option under the OpenMix tab and click on the plus in the upper right corner to add a new application..

We’re going to select the Optimal RTT quick app for the application type. This app will send traffic to the platform with the best response time according to the information Cedexis has on the user making the request.

Define the fallback host. Note that this does not have to be one of the defined platforms. This host will be used in case the application logic fails or there is a system issue within Cedexis. In this case, I trust S3 slightly more than Divshot so I’ll configure it as the fallback host.

In the second configuration step, I’ve left the default TTL of 20 seconds. This means that users should check every 20 seconds to see if a different provider should be used to return requests. Once Cedexis detects a problem, the maximum time for users to be directed to a different provider should be approximately the same as this value.

In my experience, 20 seconds is a good value to use. It is long enough that users can browse one or two pages of a site before doing any additional DNS lookups and it is short enough to react to shorter downtimes.

Increasing this value will result in fewer requests to Cedexis. To save money, consider automatically changing TTLs via RESTful Web Services. Use lower TTLs during peaks, where performance could be more erratic, and use longer TTLs during low traffic periods to save request costs.

On the third configuration step, I’ve left all the defaults. The Optimal RTT quick app will filter out any platforms which have less than 80% availability before considering them as choices for sending traffic.

Depending on the quality of your services, you may decide to lower this number but you probably do not want it any higher. Why not eliminate any platform that isn’t reporting 100% available? The answer is that RUM measurements rely on the, sometimes poor quality, home networks of your users and, as a result, they can be extremely finicky. Expecting 100% availability from a high traffic service is unrealistic and leaving a threshold of 80% will help reduce thrashing and unwanted use of the fallback host.

Regarding eDNS, you pretty much always want this enabled since many people have switched to using public DNS resolvers like Google DNS instead of the resolvers provided by their ISPs.

Shared resolvers break assumptions made by traffic optimization solutions and the eDNS standard works around this problem, passing information about the request origin to the authoritative DNS servers. Cedexis has supported eDNS from the beginning but many services still don’t.

In the final step, we will configure the service endpoints for each of the platforms we defined.

In our case, we are just associating the hostname aliases that Amazon and Divshot gave us with the correct platform and it’s Radar/Sonar data.

In a more complicated setup, you might have a platform per region of a cloud and service endpoints with different aliases or CNAMEs across each region.

Pay attention that each platform in the application has an “Enabled” checkbox. This makes it easy to go into an application and temporarily stop sending traffic to a specific platform. It is very useful avoiding downtime in case of maintenance windows, migrations, or intermittent problems with a provider.

Choose the Add Platform button on the bottom left corner of the dialog to add the second platform, not the Complete button on the bottom right.

Define the Divshot platform like we did for S3 and click Complete.

You should get a success message with the CNAME for the Cedexis application alias. Click “Publish” to activate the OpenMix application right away.

Alternatively, clicking “Done” would leave the application configured but inactive. When editing applications, you will get a similar dialog. Leaving changes saved but unpublished can be a useful way to stage changes to be activated later with the push of a button.

Building a Custom OpenMix Application

The application we just created will work, but it doesn’t take advantage of the Sonar data that we configured. To consider the Sonar metrics, we will create acustom OpenMix app and by custom, I mean copy and paste the code from Cedexis’ GitHub repository. If you’re squeamish about code, talk to your account manager and I’m sure he’ll be able to help you.

The specific code we are going to adapt can be found here (Note: I’ve fixed the link to a specific revision of the repo to make sure the instructions match, but you might choose to take the latest revision.)

We only need to modify definitions at the beginning of the script:

Let’s create a new application using the new script. Then we can switch back and forth between them if we want. We’ll start by duplicating the original app. Expand the app’s row and click Duplicate.

Switch the application type to that of a Custom Javascript app, modify the name and description to reflect that we will use Sonar data and click next.

Leave the fallback and TTL as is on the next screen.

In the third configuration step, we’ll be asked to upload our custom application.
Choose the edited version of the script and click through to complete the process.

As before, publish the application to activate it.

Adding Radar Support to our Service

At this point, Cedexis is already gathering availability data on both our platforms via Sonar. Since we used the community platform for S3, we also have performance data for that. To finish implementing private Radar for Divshot, we must include the Radar tag in our pages so our users start reporting on their performance.

We get our custom JavaScript tag from the portal (Note: If you want to get really advanced, you can go into the tag configuration section and set some custom parameters to control the behavior of the Radar tag, for example, how many test to run on each page load, etc.)
Copy the tag to your clipboard and add it in your pages on all platforms.

Going Live

Before we go live, we should really test out the application with some manual DNS requests, disabling and enabling platforms, to see that the responses change, etc.

Once we’re satisfied, the last step is to change the DNS records to point at the CNAME of the OpenMix application that we want to use. I’ll set the DNS to point at our Sonar enabled application.

A useful service to check how our applications are working ishttps://www.whatsmydns.net/. This will show how our application CNAMEs resolve from location around the world. For example, if I check the CNAME resolution for the application we just created, I get the following results:

By and large, the Western Hemisphere prefers Divshot while the Eastern Hemisphere prefers Amazon S3 in Europe. This is completely understandable. Interestingly, there are exceptions in both directions. For example, in this test, OpenMix refers the TEANET resolver from Italy to Divshot while the Level 3 resolver in New York is referred to Amazon S3 in Europe. If you refresh the test every so often, you will see that the routings change.

Since this demo site isn’t getting any live traffic, I’ve generated traffic to show off the reports. First the dashboard which gives you a quick overview of your Cedexis traffic on login. Here we show that the majority of the traffic came from North America, and a fair amount came from Europe as well. We also see that, for the most part, the traffic was split evenly between the two platforms.

To examine how Cedexis is routing our traffic, we look at the OpenMix Decision Report. I’ve added a secondary dimension of ‘Platform’ to see how the decisions have been distributed. You see that sometimes Amazon S3 is preferred and other times Divshot.

To figure out why requests were routed one way or the other, we can drill down into the data using the other reports. First, let’s check the availability stats in the Radar Performance Report. For the sake of demonstration, I’ve drilled down into availability per continent. In Asia, we see shoddy availability from Divshot but Amazon S3 isn’t perfect either. Since we didn’t see much traffic from Asia, this probably didn’t affect the traffic distribution. Theoretically, a burst of Asian traffic would result in more traffic going to Amazon.

In Europe, Divshot generally showed better availability than Amazon, reporting 100% except for a brief outage.

In North America, we see a similar graph. As to be expected, the availability of Amazon S3 in Europe is lower and less stable in North America. Divshot shows 100% availability which is also expected.

It’s important to note that the statistics here are skewed because we are comparing our private platform, measured only by our users to the community data from S3. The community platform collects many more data points in comparison to our private platform and it’s almost impossible for it to show 100% availability. This is also why we chose an 80% availability threshold when we built the OpenMix Application.

Next let’s look at the performance reports for response times of each platform. With very little traffic from Asia, the private performance measurements for Divshot are pretty erratic. With more traffic, the graph should stabilize into something more meaningful.

The graph for Europe behaves as expected showing Amazon S3 outperforming Divshot consistently.

The graph for North America also behaves as expected with Divshot consistently outperforming Amazon S3.So we’ve seen some basic data on how our traffic performs. Cedexis doesn’t stop there. We can also take a look at how our traffic could perform if we add a new provider. Let’s see how we could improve performance in North America by adding other community platforms to our graph.

I’ve added Amazon S3 in US-East, which shows almost a 30ms advantage on Divshot, though our private measurement still need to be taken lightly with so little traffic behind them. Even better performance comes from Joyent US-East. Using Joyent will require us to do more server management but if we really care about performance, Cedexis shows that it will provide a major improvement.

Summary

To recap, In this tutorial, I’ve demonstrated how to set up a basic OpenMix application to balance traffic between two providers. Balancing between multiple CDNs or implementing a Hybrid Datacenter+CDN architecture is just as easy.

Cedexis is a great solution for setting up a multi-provider infrastructure. With the ever-growing Radar community generating performance metrics from around the globe, Cedexis provides both real-time situational awareness for your services operational intelligence so you can make well-informed decisions for your business.

Unpacking Cedexis; Creating Multi-CDN and Multi-Cloud applications

package

Technology is incredibly complex and, at the same time, incredibly unreliable. As a result, we build backup measures into the DNA of everything around us. Our laptops switch to battery when the power goes out. Our cellphones switch to cellular data when they lose connection to WiFi.

At the heart of this resilience, technology is constantly choosing between the available providers of a resource with the idea that if one provider becomes unavailable, another provider will take its place to give you what you want.

When the consumers are tightly coupled with the providers, for example, a laptop consuming power at Layer 1 or a server consuming LAN connectivity at Layer 2, the choices are limited and the objective, when choosing a provider, is primarily one of availability.

Multiple providers for more than availability.

As multi-provider solutions make their way up the stack, however, additional data and time to make decisions enable choosing a provider based on other objectives like performance. Routing protocols, such as BGP, operate at Layer 3. They use path selection logic, not only to work around broken WAN connections but also to prefer paths with higher stability and lower latency.

As pervasive and successful as the multi-provider pattern is, many services fail to adopt a full stack multi-provider strategy. Cedexis is an amazing service which has come to change that by making it trivial to bring the power of intelligent, real-time, provider selection your application.

I first implemented Multi-CDN using Cedexis about 2 years ago. It was a no-brainer to go from Multi-CDN to Multi-Cloud. The additional performance, availability, and flexibility for the business became more and more obvious over time. Having a good multi-provider solution is key in cloud-based architectures and so I set out to write up a quick how-to on setting up a Multi-Cloud solution with Cedexis; but first you need to understand a bit about how Cedexis works.

Cedexis Unboxed

Cedexis has a number of components:

  1. Radar
  2. OpenMix
  3. Sonar
  4. Fusion

OpenMix

OpenMix is the brain of Cedexis. It looks like a DNS server to your users but, in fact, it is a multi-provider logic controller. In order to setup multi-provider solutions for our sites, we build OpenMix applications. Cedexis comes with the most common applications pre-built but the possibilities are pretty endless if you want something custom. As long as you can get the data you want into OpenMix, you can make your decisions based on that data in real time.

Radar

Radar is where Cedexis really turned the industry on their heads. Radar uses a javascript tag to crowdsource billions of Real User Monitoring (RUM) metrics in real time. Each time a user visits a page with the Radar tag, they take a small number of random performance measurements and send the data back to Cedexis for processing.

The measurements are non-intrusive. They only happen several seconds after your page has loaded and you can control various aspects of what and how much gets tested by configuring the JS tag in the portal.

It’s important to note that Radar has two types of measurements:

  1. Community
  2. Private.

Community Radar

Community measurements are made against shared endpoints within each service provider. All Cedexis users that implement the Radar tag and allow community measurements get free access to the community Radar statistics. The community includes statistics for the major Cloud compute, Cloud storage, and CDNs making Radar the first place I go to research developments and trends in the Cloud/CDN markets.

Community Radar is the fastest and easiest way to use Cedexis out of the box and the community measurements also have the most volume so they are very accurate all the way up to the “door” of your service provider. They do have some disadvantages though.

The community data doesn’t account for performance changes specific to each of the provider’s tenants. For example, community Radar for Amazon S3 will gather RUM data for accessing a test bucket in the specified region. This data assumes that within the S3 region all the buckets perform equally.

Additionally, there are providers which may opt out of community measurements so you might not have community data for some providers at all. In that case, I suggest you try to connect between your account managers and get them included. Sometimes it is just a question of demand.

Private Radar

Cedexis has the ability to configure custom Radar measurements as well. These measurements will only be taken by your users, the ones using your JS tag.

Private Radar lets you monitor dedicated and other platforms which aren’t included in the community metrics. If you have enough traffic, private Radar measurements have the added bonus of being specific to your user base and of measuring your specific application so the data can be even more accurate than the community data.

The major disadvantage of private Radar is that low volume metrics may not produce the best decisions. With that in mind, you will want to supplement your data with other data sources. I’ll show you how to set that up.

Situational Awareness

More than just a research tool, Radar makes all of these live metrics available for decision-making inside OpenMix. That means we can make much more intelligent choices than we could with less precise technologies like Geo-targeting and Anycast.

Most people using Geo-targeting assume that being geographically close to a network destination is also closer from the networking point of view. In reality, network latency depends on many factors like available bandwidth, number of hops, etc. Anycast can pick a destination with lower latency, but it’s stuck way down in Layer 3 of the stack with no idea about application performance or availability.

With Radar, you get real-time performance comparisons of the providers you use, from your user’s perspectives. You know that people on ISP Alice are having better performance from the East coast DC while people on ISP Bob are having better performance from the Midwest DC even if both these ISPs are serving the same geography.

Sonar

Whether you are using community or low volume private Radar measurements, you ideally want to try and get more application specific data into OpenMix. One way to do this is with Sonar.

Sonar is a synthetic polling tool which will poll any URL you give it from multiple locations and store the results as availability data for your platforms. For the simplest implementation, you need only an address that responds with an OK if everything is working properly.

If you want to get more bang for your buck, you can make that URL an intelligent endpoint so that if your platform is nearing capacity, you can pretend to be unavailable for a short time to throttle traffic away before your location really has issues.

You can also use the Sonar endpoints as a convenient way to automate diverting traffic for maintenance windows- No splash pages required.

Fusion

Fusion is really another amazing piece of work from Cedexis. As you might guess from its name, Fusion is where Cedexis glues services together and it comes in two flavors:

  1. Global Purge
  2. Data Feeds

Global Purge

By nature, one of the most apropos uses of Cedexis is to marry multiple CDN providers for better performance and stability. Every CDN has countries where they are better and countries where they are worse. In addition, maintenance windows in a CDN provider can be devastating for performance even though they usually won’t cause downtime.

The downside of a Multi-CDN approach is the overhead involved in managing each of the CDNs and most often that means purging content from the cache. Fusion allows you to connect to multiple supported CDN providers (a lot of them) and purge content from all of them from one interface inside Cedexis.

While this is a great feature, I have to add that you shouldn’t be using it. Purging content from a CDN is very Y2K and you should be using versioned resources with far futures expiry headers to get the best performance out of your sites and so you never have to purge content from a CDN ever again.

Data Feeds

This is the really great part. Fusion lets you import data from basically anywhere to use in your OpenMix decision making process. Built in, you will find connections to various CDN and monitoring services, but you can also work with Cedexis to setup custom Fusion integrations so the sky’s the limit.

With Radar and Sonar, we have very good data on performance and availability (time and quality) both from the real user perspective and a supplemental synthetic perspective. To really optimize our traffic we need to account for all three corners of the Time, Cost, Quality triangle.

With Fusion, we can introduce cost as a factor in our decisions. Consider a company using multiple CDN providers, each with a minimum monthly commitment of traffic. If we would direct traffic based on performance alone, we might not meet the monthly commitment on one provider but be required to pay for traffic we didn’t actually send. Fusion provides usage statistics for each CDN and allows OpenMix to divert traffic so that we optimize our spending.

Looking Forward

With all the logic we can build into our infrastructure using Cedexis, it could almost be a fire and forget solution. That would, however, be a huge waste. The Internet is always evolving. Providers come and go. Bandwidth changes hands.

Cedexis reports provide operational intelligence on alternative providers without any of the hassle involved in a POC. Just plot the performance of the provider you’re interested in against the performance of your current providers and make an informed decision to further improve your services. When something better come along, you’ll know it.

The Nitty Gritty

Keep an eye out for the next article where I’ll do a step by step walk-through on setting up a Multi-Cloud solution using Cedexis. I’ll cover almost everything mentioned here, including Private and Community Radar, Sonar, Standard and Custom OpenMix Applications, and Cedexis Reports.

Virtual Block Storage Crashed Your Cloud Again :(

darthvader

You know it’s bad when you start writing an incident report with the words “The first 12 hours.” You know you need a stiff drink, possibly a career change, when you follow that up with phrases like “this was going to be a lengthy outage…”, “the next 48 hours…”, and “as much as 3 days”.

That’s what happened to huge companies like NetFlix, Heroku, Reddit,Hootsuite, Foursquare, Quora, and Imgur the week of April 21, 2011. Amazon AWS went down for over 80 hours, leaving them and others up a creek without a paddle. The root cause of this cloud-tastrify echoed loud and clear.

Heroku said:

The biggest problem was our use of EBS drives, AWS’s persistent block storage solution… Block storage is not a cloud-friendly technology. EC2, S3, and other AWS services have grown much more stable, reliable, and performant over the four years we’ve been using them. EBS, unfortunately, has not improved much, and in fact has possibly gotten worse. Amazon employs some of the best infrastructure engineers in the world: if they can’t make it work, then probably no one can.

Reddit said:

Amazon had a failure of their EBS system, which is a data storage product they offer, at around 1:15am PDT. This may sound familiar, because it was the same type of failure that took us down a month ago. This time however the failure was more widespread and affected a much larger portion of our servers

While most companies made heartfelt resolutions to get off of EBS, NetFlix was clear to point out that they never trusted EBS to begin with:

When we re-designed for the cloud this Amazon failure was exactly the sort of issue that we wanted to be resilient to. Our architecture avoids using EBS as our main data storage service…

Fool me once…

As Reddit mentioned in their postmortem, AWS had similar EBS problems twice before on a smaller scale in March. After an additional 80+ hours of downtime, you would expect companies to learn their lesson, but the facts are that these same outages continue to plague clouds using various types of virtual block storage.

In July 2012, AWS experienced a power failure which resulted in a huge number of possibly inconsistent EBS volumes and an overloaded control plane. Some customers experienced almost 24 hours of downtime.

Heroku, under the gun again, said:

Approximately 30% of our EC2 instances, which were responsible for running applications, databases and supporting infrastructure (including some components specific to the Bamboo stack), went offline…
A large number of EBS volumes, which stored data for Heroku Postgres services, went offline and their data was potentially corrupted…
20% of production databases experienced up to 7 hours of downtime. A further 8% experienced an additional 10 hours of downtime (up to 17 hours total). Some Beta and shared databases were offline for a further 6 hours (up to 23 hours total).

AppHarbor had similar problems:

EC2 instances and EBS volumes were unavailable and some EBS volumes became corrupted…
Unfortunately, many instances were restored without associated EBS volumes required for correct operation. When volumes did become available they would often take a long time to properly attach to EC2 instances or refuse to attach altogether. Other EBS volumes became available in a corrupted state and had to be checked for errors before they could be used.
…a software bug prevented this fail-over from happening for a small subset of multi-az RDS instances. Some AppHarbor MySQL databases were located on an RDS instance affected by this problem.

The saga continues for AWS who continued to have problems with EBS later in 2012. They detail ad nauseam, how a small DNS misconfiguration triggered a memory leak which caused a silent cascading failure of all the EBS servers. As usual, the EBS failures impacted API access and RDS services. Yet again Multi-AZ RDS instances didn’t failover automatically.

Who’s using Virtual Block Storage?

Amazon EBS is just one very common example of Virtual Block Storage and by no means, the only one to fail miserably.

Azure stores the block devices for all their compute nodes as Blobs in their premium or standard storage services. Back in November, a bad update to the storage service sent some of their storage endpoints into infinite loops, denying access to many of these virtual hard disks. The bad software was deployed globally and caused more than 10 hours of downtime across 12 data centers. According to the post, some customers were still being affected as much as three days later.

HP Cloud provides virtual block storage based on OpenStack Cinder. See related incident reports here, here, here, here, here. I could keep going back in time, but I think you get the point.

Also based on Cinder, Rackspace offers their Cloud Block Storage product. Their solution has some proprietary component they call Lunr, as detailed in this Hacker News thread so you can hope that Lunr is more reliable than other implementations. Still, Rackspace had major capacity issues spanning over two weeks back in May of last year and I shudder to think what would have happened if anything went wrong while capacity was low.

Storage issues are so common and take so long to recover from in OpenStack deployments, that companies are deploying alternate cloud platforms as a workaround while their OpenStack clouds are down.

What clouds won’t ruin your SLA?

Rackspace doesn’t force you to use their Cloud Block Storage, at least not yet, so unless they are drinking their own kool-aid in ways they shouldn’t be, you are hopefully safe there.

Digital Ocean also relies on local block storage by design. They are apparently considering other options but want to avoid an EBS-like solution for the reasons I’ve mentioned. While their local storage isn’t putting you at risk of a cascading failure, they have been reported to leak your data to other customers if you don’t destroy your machines carefully. They also have other fairly frequent issueswhich take them out of the running for me.

The winning horse

As usual, Joyent shines through on this. For many reasons, the SmartDataCenter platform, behind both their public cloud and open source private cloud solutions, supports only local block storage. For centralized storage, you can use NFS or CIFS if you really need to but you will not find virtual block storage or even SAN support.

Joyent gets some flack for this opinionated architecture, occasionally even from me, but they don’t corrupt my data or crash my servers because some virtual hard disk has gone away or some software upgrade has been foolishly deployed.

With their recently released Docker and Linux Binary support, Joyent is really leading the pack with on-metal performance and availability. I definitely recommend hitching your wagon to their horse.

The Nooooooooooooooo! button

If it’s too late and you’re only finding this article post cloud-tastrify, I refer you to the ever amusing Nooooooooooooooo! button for some comic relief.

Ynet on AWS. Let’s hope we don’t have to test their limits.

tightrope

In Israel, more than in most places, no news is good news. Ynet, one of the largest news sites in Israel, recently posted a case study (at the bottom of this article) on handling large loads by moving their notification services to AWS.

“We used EC2, Elastic Load Balancers, and EBS… Us as an enterprise, we need something stable…”

They are contradicting themselves in my opinion. EBS and Elastic Load Balancers (ELB) are the two AWS services which fail the most and fail hardest with multiple downtimes spanning multiple days each.

EBS: Conceptually flawed, prone to cascading failures

EBS, a virtual block storage service, is conceptually flawed and prone to severe cascading failures. In recent years, Amazon has improved reliability somewhat, mainly by providing such a low level of service on standard EBS, that customers are default to paying extra for reserved IOPS and SSD backed EBS volumes.

Many cloud providers avoid the problematic nature of virtual block storage entirely, preferring compute nodes based on local, direct attached storage.

ELB: Too slow to adapt, silently drops your traffic

In my experience, ELBs are too slow to adapt to spikes in traffic. About a year ago, I was called to investigate availability issues with one of our advertising services. The problems were intermittent and extremely hard to pin down. Luckily, as a B2B service, our partners noticed the problems. Our customers would have happily ignored the blank advertising space.

Suspecting some sort of capacity problem, I ran some synthetic load tests and compared the results with logs on our servers. Multiple iterations of these tests with and without ELB in the path confirmed a gruesome and silent loss of 40% of our requests when traffic via Elastic Load Balancers grew suddenly.

The Elastic Load Balancers gave us no indication that they were dropping requests and, although they would theoretically support the load once Amazon’s algorithms picked up on the new traffic, they just didn’t scale up fast enough. We wasted tons of money in bought media that couldn’t show our ads.

Amazon will prepare your ELBs for more traffic if you give them two weeks notice and they’re in a good mood but who has the luxury of knowing when a spike in traffic will come?

Recommendations

I recommend staying away from EC2, EBS, and ELB if you care about performance and availability. There are better, more reliable providers like Joyent. Rackspace without using their cloud block storage (basically the same as EBS with the same flaws) would be my second choice.

If you must use EC2, try to use load balancing AMIs from companies like Riverbed or F5 instead of ELB.

If you must use ELB, make sure you run synthetic load tests at random intervals and make sure that Amazon isn’t dropping your traffic.

Conclusion

In conclusion, let us hope that we have no reasons to test the limits of Ynet’s new services, and if we do, may it only be good news.

SmartDataCenter, the Open Cloud Platform that Actually Already Works

sdc

For years enterprises have tried to make OpenStack work and failed miserably. Considering how many heads have broken against OpenStack, maybe they should have called it OpenBrick.

Before I dive into the details, I’ll cut to the chase. You don’t have to break your heads on cloud anymore. Joyent have open sourced (as in get it on Github) their cloud management platform.

It’s free if you want (install it on your laptop, install it on a server). It’s supported if you want. Best of all, it actually works outside of a lab or CI test suite. It’s what Joyent runs in production for all their public cloud customers (I admit to being one of the satisfied ones). It’s also something they have been licensing out to other cloud providers for years.

Now for the deep dive.

What’s wrong with OpenStack?

First off, it isn’t a cloud in a box, which is what most people think it is. In 2013,Gartner called out OpenStack for consciously misrepresenting what OpenStack actually provides:

no one in three years stood up to clarify what OpenStack can and cannot do for an enterprise.

In case you’re wondering, the analyst also quoted Ebay’s chief engineer on the true nature of OpenStack:

… an instance of an OpenStack installation does not make a cloud. As an operator you will be dealing with many additional activities not all of which users see. These include infra onboarding, bootstrapping, remediation, config management, patching, packaging, upgrades, high availability, monitoring, metrics, user support, capacity forecasting and management, billing or chargeback, reclamation, security, firewalls, DNS, integration with other internal infrastructure and tools, and on and on and on. These activities are bound to consume a significant amount of time and effort. OpenStack gives some very key ingredients to build a cloud, but it is not cloud in a box.

The analyst made it clear that:

vendors get this difference, trust me.

Other insiders put the situation into similar terms:

OpenStack has some success stories, but dead projects tell no tales. I have seen no less than 100 Million USD spent on bad OpenStack implementations that will return little or have net negative value.Some of that has to be put on the ignorance and arrogance of some of the organizations spending that money, but OpenStack’s core competency, above all else, has been marketing and if not culpable, OpenStack has at least been complicit.

The motive behind the deception is clear. OpenStack is like giving someone a free Ferrari, pink slip and all but keeping the keys. You get pieces of a cloud but no way to run it. Once you have put all your effort into installing OpenStack and you realize what’s missing, you are welcome to turn to any one of the vendors backing OpenStack for one of their packaged cloud platforms.

OpenStack is a foot in the door. It’s a classic bait and switch but even after years, no one is admitting it. Instead, blue chip companies fight to steer OpenStack into the direction that suits them and their corporate offerings.

What’s great about SmartDataCenter?

It works.

The keys are in the ignition. You should probably stop reading this article and install it already. You are likely to get a promotion for listening to me 😉

Great Technology

SmartDataCenter was built on really great technologies like SmartOS (fork of Solaris), Zones, ZFS, and DTrace. Most of these technologies are slowly being ported to Linux but they are already 10 years mature in SDC.

  • Being based on a fork of Solaris brings you baked in enterprise ready features like IPSEC, IPF, RBAC, SMF, Resource management and capping, System auditing, Filesystem monitoring, etc.
  • Zones are the big daddy of container technology guaranteeing you the best on-metal performance for your cloud instances. If you are running a native SmartOS guest, you get the added benefit of CPU bursting and live machine re-size (no reboot, or machine pause necessary).
  • ZFS is the most reliable, high performance, file system in the world and is constantly improving.
  • DTrace is the secret to low level visibility with zero to no overhead. In cloud deployments where visibility is usually close to zero, this is an amazing feature. It’s even more amazing as the cloud operator.

Focus

SDC was built for one thing by one company, to replace the data centers of the past. It says so in the name. With one purpose, SDC has been built to be veryopinionated about what it does and how it does it. This gives SDC a tremendous amount of focus, something sorely lacking from would-be competition like OpenStack.

Lastly, it works.