Month: January 2015

Wrangling Elephants in the Cloud

elephant

You know the elephant in the room, the one no one wants to talk about. Well it turns out there was a whole herd of them hiding in my cloud. There’s a herd of them hiding in your cloud too. I’m sure of it. Here is my story and how I learned to wrangle the elephants in the cloud.

Like many of you, my boss walked into my office about three years ago and said “We need to move everything to the cloud.” At the time, I wasn’t convinced that moving to the cloud had technical merit. The business, on the other hand, had decided that, for whatever reason, it was absolutely necessary.

As I began planning the move, selecting a cloud provider, picking tools with which to manage the deployment, I knew that I wasn’t going to be able to provide the same quality of service in a cloud as I had in our server farm. There were too many unknowns.

The cloud providers don’t like to give too many details on their setups nor do they like to provide many meaningful SLAs. I have very little idea what hardware I’m running. I have almost no idea how it’s connected. How many disks I’m running on? What RAID configuration? How many IOPS can I count on? Is a disk failing? Is it being replaced? What will happen if the power supply blows? Do I have redundant network connections?

Whatever it was that made the business decide to move, it trumped all these unknowns. In the beginning, I focused on getting what we had from one place to the other, following whichever tried and true best practices were still relevant.

Since then, I’ve come up with these guiding principles for working around the unknowns in the cloud.

  • Beginners:
    • Develop in the cloud
    • Develop for failure
    • Automate deployment to the cloud
    • Distribute deployments across regions
  • Advanced:
    • Monitor everything
    • Use multiple providers
    • Mix and match private cloud

Wrangling elephants for beginners:

Develop in the cloud.

Developers invariably want to work locally. It’s more comfortable. It’s faster. It’s why you bought them a crazy expensive MacBook Pro. It is also nothing like production and nothing developed that way ever really works the same in real life.

If you want to run with the IOPS limitations of standard Amazon EBS or you want to rely on Amazon ELBs to distribute traffic under sudden load, you need to have those limitations in development as well. I’ve seen developers cry when their MongoDB deployed to EBS and I’ve seen ELBs disappear 40% of a huge media campaign.

Develop for failure.

Cloud providers will fail. It is cheaper for them to fail and in the worst case, credit your account for some machine hours, than it is for them to buy high quality hardware and setup highly available networks. In many cases, the failure is not even a complete and total failure (that would be too easy). Instead, it could just be some incredibly high response times which your application may not know how to deal with.

You need to develop your application with these possibilities in mind. Chaos Monkey by Netflix is a classic, if not over-achieving example.

Automate deployment to the cloud.

I’m not even talking about more complicated, possibly over complicated, auto-scaling solutions. I’m talking about when it’s 3am and your customers are switching over to your competitors. Your cloud provider just lost a rack of machines including half of your service. You need to redeploy those machines ASAP, possibly to a completely different data center.

If you’ve automated your deployments and there aren’t any other hiccups, it will hopefully take less than 30 minutes to get back up. If not, well, it will take what it takes. There are many other advantages to automating your deployments but this is the one that will let you sleep at night.

Distribute deployments across regions.

A pet peeve of mine is the mess that Amazon has made with their “availability zones.” While the concept is a very easy to implement solution (from Amazon’s point of view) to the logistical problems involved in running a cloud service, it is a constantly overlooked source of unreliability for beginners choosing Amazon AWS. Even running a multi-availability zone deployment in Amazon only marginally increases reliability whereas deploying to multiple regions can be much more beneficial with a similar amount of complexity.

Whether you use Amazon or another provider, it is best to build your service from the ground up to run in multiple regions, even only in an active/passive capacity. Aside from the standard benefits of a distributed deployment (mitigation of DDOS attacks and uplink provider issues, lower latency to customers, disaster recovery, etc.), running in multiple regions will protect you against regional problems caused by hardware failure, regional maintenance, or human error.

Advanced elephant wrangling:

The four principles before this are really about being prepared for the worst. If you’re prepared for the worst, then you’ve managed 80% of the problem. You may be wasting resources or you may be susceptible to provider level failures, but your services should be up all of the time.

Monitor Everything.

It is very hard to get reliable information about system resource usage in a cloud. It really isn’t in the cloud provider’s interest to give you that information, after all, they are making money by overbooking resources on their hardware. No, you shouldn’t rely on Amazon to monitor your Amazon performance, at least not entirely.

Even when they give you system metrics, it might not be the information you need to solve your problem. I highly recommend reading the book Systems Performance – Enterprise and the Cloud by Brendan Gregg.

Some clouds are better than others at providing system metrics. If you can choose them, great! Otherwise, you need to start finding other strategies for monitoring your systems. It could be to monitor your services higher up in the stack by adding more metric points to your code. It could be to audit your request logs. It could be to install an APM agent.

Aside from monitoring your services, you need to monitor your providers. Make sure they are doing their jobs. Trust me that some times they aren’t.

I highly recommend monitoring your services from multiple points of view so you can corroborate the data from multiple observers. This happens to fit in well with the next principle.

Use multiple providers.

There is no way around it. Using one provider for any third party service is putting all your eggs in one basket. You should use multiple providers for everything in your critical path, especially the following four:

  • DNS
  • Cloud
  • CDN
  • Monitoring

Regarding DNS, there are some great providers out there. CloudFlare is a great option for the budget conscious. Route53 is not free but not expensive. DNSMadeEasy is a little bit pricier but will give you some more advanced DNS features. Some of the nastiest downtimes in the past year were due to DNS provider

Regarding Cloud, using multiple providers requires very good automation and configuration management. If you can find multiple providers which run the same underlying platform (for example, Joyent licenses out their cloud platform to various other public cloud vendors), then you can save some work. In any case, using multiple cloud providers can save you from some downtime, bad cloud maintenance or worse.

CDNs also have their ups and downs. The Internet is a fluid space and one CDN may be faster one day and slower the next. A good Multi-CDN solution will save you from the bad days, and make every day a little better at the same time.

Monitoring is great but who’s monitoring the monitor. It’s a classic problem. Instead of trying to make sure every monitoring solution you use is perfect, use multiple providers from multiple points of view (application performance, system monitoring, synthetic polling).

These perspectives all overlap to some degree backing each other up. If multiple providers start alerting, you know there is a real actionable problem and from how they alert, you can sometimes home in on the root cause much more quickly.

If your APM solution starts crying about CPU utilization but your system monitoring solution is silent, you know that you may have a problem that needs to be verified. Is the APM system misreading the situation or has your system monitoring agent failed to warn you of a serious issue?

Mix and match private cloud

Regardless of all the above steps you can take to mitigate the risks of working in environments not completely in your control, really important business should remain in-house. You can keep the paradigm of software defined infrastructure by building a private cloud.

Joyent license their cloud platform out to companies for building private clouds with enterprise support. This makes a mixing and matching between public and private very easy. In addition, they have open sourced the entire cloud platform so if you want to install without support, you are free to do so.

Summary

When a herd of elephants is stampeding, there is no hope of stopping them in their tracks. The best you can hope for is to point them in the right direction. Similarly, in the cloud, we will never get back the depth of visibility and control that we have with private deployments. What’s important is to learn how to steer the herd so we are prepared for the occasional stampede while still delivering high quality systems.

Save Money on Private SSL CDN and Improve Performance at the Same Time

burning money

SSL support in your site, and therefore, in your CDN is critical but it is also incredibly expensive. In this article, I’ll show you how to save a fortune while improving the performance and security of your site.

It used to be that sites only encrypted the most sensitive traffic with their customers, i.e. registration, login, checkout, etc. In 2010, the Firesheep extension made it very clear that this is not enough.

Since then, many other attacks on partially encrypted sites have been devised and so major sites like Google, Twitter, Facebook, and others have all switched to HTTPS by default, aka Always On SSL. Last year, Google called for HTTPS everywhere and later, announced that secured sites will get boosts in PageRank.

Shared SSL from the CDN is cheap.

Shared SSL in the CDN is better than nothing but it isn’t great. It’s better than nothing because it will help with mixed content warnings. It’s not great because it isn’t on the same domain as your site so it can cause same origin policy problems.

Why is Private SSL CDN so expensive?

That’s a great question and I’m not sure there is a good answer. Some of the common excuses are:

  1. Encrypted traffic is heavier than unencrypted traffic so it should cost more. This is only very slightly true and recent work in this area has made this even less true.
  2. The logistics of deploying the SSL Certificates across all the CDN edge nodes is expensive. It’s honestly hard to believe that companies base all their business on managing uncountable edge nodes, have real trouble deploying the certificates.
  3. It’s only worth it for the CDN company to assume the security risk of HTTPS traffic if they charge you more money. This is probably the closest answer to the truth.

Regardless of the real reason, CDN’s often charge both exorbitant one time setup fees and high monthly fees for each SSL enabled configuration.

Common mistakes which are costing you money.

Companies often make several mistakes when ordering SSL services, costing them thousands of extra dollars each month:

  1. Ordering too many domains
  2. Ordering the wrong type of SSL service
  3. Not negotiating the fees

Ordering too many domains

Companies often spread their sites over multiple subdomains, i.e. www.yahu.com, mail.yahu.com, shopping.yahu.com, etc. Each of these sites will have it’s own content and many companies understandably setup separate CDN configurations for each origin, i.e. cdn.www.yahu.com, cdn.mail.yahu.com.

Ordering SSL from the CDN for each of those domains will cost a fortune. Instead, create a single CDN configuration for each root domain, i.e. cdn.yahu.com, cdn.giggle.com. In each configuration, use the CDN provider’s built in URL rewriting and origin rewriting features to direct the requests to the appropriate origin.

For example, configure cdn.yahu.com/www/ to cache content from www.yahu.com while cdn.yahu.com/shopping/ caches content from shopping.yahu.com. Now you have cut down the number of SSL slots you need to order drastically.

Ordering the wrong type of SSL service

Different CDN providers offer different types of SSL service. Some provide standard single domain certificates. Some use a multi-domain certificate. Some offer wildcard certificates.

Now that you have minimized the number of domains you need to protect in the CDN, you can re-evaluate the type of SSL certificate you need. In the example above, we might have thought to order an expensive wildcard certificate to protect all the subdomains of yahu.com, whereas now we can choose a less expensive single domain certificate.

If we can’t use URL rewriting to save on SSL enabled CDN domains, it may be less expensive to get one wildcard certificate, than to get several domains on a multi-domain certificate.

Not negotiating the fees

CDN fees are always negotiable. Usually, the more traffic you commit to each month, the lower a price you get. SSL slots are also open to volume discounts. If you need multiple SSL slots, try to order them at the same time and ask for a discount on the setup fees. The logic- even if they install each certificate manually, they can install all your certs at the same time.

You should also be able to get a discount on the monthly fees just because you are committing to pay them more each month. A better deal to ask for, is to offer to over-commit on monthly traffic commitment, in exchange for a discount on the SSL monthly fees.

For example, you have a monthly $5K commit for traffic and you are adding five SSL slots at a $750/month list price. They offer to go down to $700 monthly per SSL because you are adding five (a total of $8.5K monthly commitment). Counter back with an offer to commit to $6K monthly traffic + $500 monthly per SSL. This is better because, even if your traffic grows by 20%, you will continue paying $8.5K/month but you are still getting the SSL service.

How does this improve performance?

There is an added, hidden benefit in consolidating your CDN hostnames.

Now, when a customer first reaches one of your sites, i.e. www.yahu.com, their browser looks up and connects to your consolidated CDN domain – cdn.yahu.com. When they browse your site and switch to a new subdomain, for example if they clicked on shopping.yahu.com, they are already connected to cdn.yahu.com. They save on additional DNS lookups and additional TCP handshakes (the most expensive part of SSL traffic).

You can push this even further by sharing resource between your origins using a special CDN location, for example by storing common CSS files at cdn.yahu.com/shared/. In this case, shopping.yahu.com will be able to use the pre-loaded and cached CSS files from the browser’s initial visit to www.yahu.com.

Summary

I’ve used these SSL consolidation techniques in both Akamai and CDNetworks realizing significant cost and performance savings. Although I’ve worked with at least five other CDN providers, I haven’t tried these techniques elsewhere.

If you’re interested in optimizing your CDN deployment for best cost/performance, feel free to contact me via LinkedIn or via https://donatemyfee.org/.

The Ball is in the Net. Goal or No Goal?

goal

The ball hit the net but from which side. Can you tell? Over the past three years, companies have pushed themselves to the cloud for many reasons but have they landed in the wrong side of the net?

Many companies have mistaken moving to the cloud for a goal to be achieved and it is natural to make that mistake. Companies see the bottom line, that building services in PAAS or IAAS clouds lowers the costs of bootstrapping risky projects, speeds up time to market and enables greater flexibility. They naturally make moving everything to the cloud a business target.

They miss that driving these benefits are the ways that automation and infrastructure as a service force the modernization and industrialization of a company’s IT teams and processes. Even if a company isn’t using any modern software driven deployment techniques, it is the industrialization of infrastructure on the provider’s side that allows a “machine” to be spec’ed, purchased, racked, cabled, and installed at the push of a button or the call of an API. It is this change in the way that IT works that is improving the bottom line, speeding time to market and increasing the business agility.

Companies that make this distinction realize that hosting your servers in a cloud, private or public, isn’t an end, in and of itself. If it’s the automation and software defined infrastructure that is helping business, then that has to be the focus.

In reality, IAAS is still very immature. There is no provider today that can provide public IAAS which meets the standards of a high quality private deployment, let alone enterprise grade.

Visionary companies like Netflix have built vast frameworks to compensate for some of the problems with public cloud. In 2013, Netflix’s director of cloud solutions Ariel Tseitlin is quoted as having said “We’re far from being in a commoditized cloud market. It really isn’t a utility like we feel someday it is going to become. If you look at how much infrastructure we built, the huge amount of extra glue and services and tooling we’ve invested in, that gives you an indication of what could be offered in the future.”

Others, like Zynga, went the hybrid cloud route because “While the public cloud is exceptional at providing a wealth of services for various computing needs, we’re an outlier and not a traditional IT workload. The performance and availability required to operate social games on the scale that we do, requires the ability to fine tune infrastructure… We learned to understand our workload, look into the black box of cloud computing, and built what we affectionately call zCloud, our own private cloud infrastructure. zCloud looks, feels, and operates similar to the way we use the public cloud, but allows for greater performance, scale and reliability.”

Between these two notorious cloud consumers, the common denominator is the drive to change the way infrastructure is consumed by the business without sacrificing the quality or reliability of the services. That is why companies should be striving to modernize mainstream IT whether on private, public, or hybrid infrastructure.

Note: Jason Hoffman, Founder and former CTO of Joyent, now Head of Cloud Technology at Ericsson, really put this into perspective to me with this video segment (from which I definitely cannibalized some jargon). After living and breathing “cloud” for the past three years, I think he’s really hit the nail on the head and it will be interesting to see what Ericsson can do to bring forth the next iteration of IAAS.