Tag: Content delivery network

Unpacking Cedexis; Creating Multi-CDN and Multi-Cloud applications

package

Technology is incredibly complex and, at the same time, incredibly unreliable. As a result, we build backup measures into the DNA of everything around us. Our laptops switch to battery when the power goes out. Our cellphones switch to cellular data when they lose connection to WiFi.

At the heart of this resilience, technology is constantly choosing between the available providers of a resource with the idea that if one provider becomes unavailable, another provider will take its place to give you what you want.

When the consumers are tightly coupled with the providers, for example, a laptop consuming power at Layer 1 or a server consuming LAN connectivity at Layer 2, the choices are limited and the objective, when choosing a provider, is primarily one of availability.

Multiple providers for more than availability.

As multi-provider solutions make their way up the stack, however, additional data and time to make decisions enable choosing a provider based on other objectives like performance. Routing protocols, such as BGP, operate at Layer 3. They use path selection logic, not only to work around broken WAN connections but also to prefer paths with higher stability and lower latency.

As pervasive and successful as the multi-provider pattern is, many services fail to adopt a full stack multi-provider strategy. Cedexis is an amazing service which has come to change that by making it trivial to bring the power of intelligent, real-time, provider selection your application.

I first implemented Multi-CDN using Cedexis about 2 years ago. It was a no-brainer to go from Multi-CDN to Multi-Cloud. The additional performance, availability, and flexibility for the business became more and more obvious over time. Having a good multi-provider solution is key in cloud-based architectures and so I set out to write up a quick how-to on setting up a Multi-Cloud solution with Cedexis; but first you need to understand a bit about how Cedexis works.

Cedexis Unboxed

Cedexis has a number of components:

  1. Radar
  2. OpenMix
  3. Sonar
  4. Fusion

OpenMix

OpenMix is the brain of Cedexis. It looks like a DNS server to your users but, in fact, it is a multi-provider logic controller. In order to setup multi-provider solutions for our sites, we build OpenMix applications. Cedexis comes with the most common applications pre-built but the possibilities are pretty endless if you want something custom. As long as you can get the data you want into OpenMix, you can make your decisions based on that data in real time.

Radar

Radar is where Cedexis really turned the industry on their heads. Radar uses a javascript tag to crowdsource billions of Real User Monitoring (RUM) metrics in real time. Each time a user visits a page with the Radar tag, they take a small number of random performance measurements and send the data back to Cedexis for processing.

The measurements are non-intrusive. They only happen several seconds after your page has loaded and you can control various aspects of what and how much gets tested by configuring the JS tag in the portal.

It’s important to note that Radar has two types of measurements:

  1. Community
  2. Private.

Community Radar

Community measurements are made against shared endpoints within each service provider. All Cedexis users that implement the Radar tag and allow community measurements get free access to the community Radar statistics. The community includes statistics for the major Cloud compute, Cloud storage, and CDNs making Radar the first place I go to research developments and trends in the Cloud/CDN markets.

Community Radar is the fastest and easiest way to use Cedexis out of the box and the community measurements also have the most volume so they are very accurate all the way up to the “door” of your service provider. They do have some disadvantages though.

The community data doesn’t account for performance changes specific to each of the provider’s tenants. For example, community Radar for Amazon S3 will gather RUM data for accessing a test bucket in the specified region. This data assumes that within the S3 region all the buckets perform equally.

Additionally, there are providers which may opt out of community measurements so you might not have community data for some providers at all. In that case, I suggest you try to connect between your account managers and get them included. Sometimes it is just a question of demand.

Private Radar

Cedexis has the ability to configure custom Radar measurements as well. These measurements will only be taken by your users, the ones using your JS tag.

Private Radar lets you monitor dedicated and other platforms which aren’t included in the community metrics. If you have enough traffic, private Radar measurements have the added bonus of being specific to your user base and of measuring your specific application so the data can be even more accurate than the community data.

The major disadvantage of private Radar is that low volume metrics may not produce the best decisions. With that in mind, you will want to supplement your data with other data sources. I’ll show you how to set that up.

Situational Awareness

More than just a research tool, Radar makes all of these live metrics available for decision-making inside OpenMix. That means we can make much more intelligent choices than we could with less precise technologies like Geo-targeting and Anycast.

Most people using Geo-targeting assume that being geographically close to a network destination is also closer from the networking point of view. In reality, network latency depends on many factors like available bandwidth, number of hops, etc. Anycast can pick a destination with lower latency, but it’s stuck way down in Layer 3 of the stack with no idea about application performance or availability.

With Radar, you get real-time performance comparisons of the providers you use, from your user’s perspectives. You know that people on ISP Alice are having better performance from the East coast DC while people on ISP Bob are having better performance from the Midwest DC even if both these ISPs are serving the same geography.

Sonar

Whether you are using community or low volume private Radar measurements, you ideally want to try and get more application specific data into OpenMix. One way to do this is with Sonar.

Sonar is a synthetic polling tool which will poll any URL you give it from multiple locations and store the results as availability data for your platforms. For the simplest implementation, you need only an address that responds with an OK if everything is working properly.

If you want to get more bang for your buck, you can make that URL an intelligent endpoint so that if your platform is nearing capacity, you can pretend to be unavailable for a short time to throttle traffic away before your location really has issues.

You can also use the Sonar endpoints as a convenient way to automate diverting traffic for maintenance windows- No splash pages required.

Fusion

Fusion is really another amazing piece of work from Cedexis. As you might guess from its name, Fusion is where Cedexis glues services together and it comes in two flavors:

  1. Global Purge
  2. Data Feeds

Global Purge

By nature, one of the most apropos uses of Cedexis is to marry multiple CDN providers for better performance and stability. Every CDN has countries where they are better and countries where they are worse. In addition, maintenance windows in a CDN provider can be devastating for performance even though they usually won’t cause downtime.

The downside of a Multi-CDN approach is the overhead involved in managing each of the CDNs and most often that means purging content from the cache. Fusion allows you to connect to multiple supported CDN providers (a lot of them) and purge content from all of them from one interface inside Cedexis.

While this is a great feature, I have to add that you shouldn’t be using it. Purging content from a CDN is very Y2K and you should be using versioned resources with far futures expiry headers to get the best performance out of your sites and so you never have to purge content from a CDN ever again.

Data Feeds

This is the really great part. Fusion lets you import data from basically anywhere to use in your OpenMix decision making process. Built in, you will find connections to various CDN and monitoring services, but you can also work with Cedexis to setup custom Fusion integrations so the sky’s the limit.

With Radar and Sonar, we have very good data on performance and availability (time and quality) both from the real user perspective and a supplemental synthetic perspective. To really optimize our traffic we need to account for all three corners of the Time, Cost, Quality triangle.

With Fusion, we can introduce cost as a factor in our decisions. Consider a company using multiple CDN providers, each with a minimum monthly commitment of traffic. If we would direct traffic based on performance alone, we might not meet the monthly commitment on one provider but be required to pay for traffic we didn’t actually send. Fusion provides usage statistics for each CDN and allows OpenMix to divert traffic so that we optimize our spending.

Looking Forward

With all the logic we can build into our infrastructure using Cedexis, it could almost be a fire and forget solution. That would, however, be a huge waste. The Internet is always evolving. Providers come and go. Bandwidth changes hands.

Cedexis reports provide operational intelligence on alternative providers without any of the hassle involved in a POC. Just plot the performance of the provider you’re interested in against the performance of your current providers and make an informed decision to further improve your services. When something better come along, you’ll know it.

The Nitty Gritty

Keep an eye out for the next article where I’ll do a step by step walk-through on setting up a Multi-Cloud solution using Cedexis. I’ll cover almost everything mentioned here, including Private and Community Radar, Sonar, Standard and Custom OpenMix Applications, and Cedexis Reports.

How to Host a Screaming Fast Site for $0.03/Month

perf

I had an idea. That’s always how it starts. Before I know it, I’ve purchased the domain name and I’m futzing around with some HTML but where am I going to host it and how much is this going to end up costing me?

That’s where I was when I came up with #DonateMyFee. “This is a site that is only going to cost me money”, I thought to myself (the whole point is for people to donate money rather than paying me). I really didn’t want to start shelling out big (or small) bucks on hosting.

Long story short, here is the recipe for a screaming fast website on a low budget:

Amazon S3

I’m not a huge fan of Amazon AWS, but S3 is useful enough to make it into my good graces. S3 is Amazon’s storage service. You upload static files into “buckets” and S3 can hold on to them, version them, and most importantly serve them via http. When configured to serve a bucket as a static website, S3 can be used to replace the load balancing and web serving infrastructure needed to serve a static website.

There are only two problems with that.

  1. You pay for S3 by the amount of traffic pulled from your bucket.
  2. Your “website” will be called something crazy ugly like donatemyfee.com.s3-website-eu-west-1.amazonaws.com

Regarding the price, S3 tries to get you three ways. They charge for the volume of the data being stored, for the number of requests made, and for the volume of the request throughput in GB. That said, the prices are very reasonable if we can keep the number of requests low. For that reason, a CDN is an absolute must. The CDN will also solve our second problem – the unfriendly S3 website name.

Often S3 is paired with Amazon’s CDN, Cloudfront, but I don’t recommend it. Cloudfront is expensive as CDN’s go and we’re on a budget. Even if we wanted to pay for the CDN, there are better performing options for less. CloudFlare is a great alternative with a free plan that will do us wonders.

CloudFlare

CloudFlare is one of several CDN by proxy + Webapp Firewall solutions that cropped up several years ago. Since the beginning, they have had a free plan and they have proven to be both innovative and competitive.

To use CloudFlare , we need to set their servers as your domain’s DNS name servers which can be a deal breaker in some cases. Once that’s setup we create a CNAME record in CloudFlare which points to the ugly S3 website name. CloudFlare has a new CNAME flattening technique which will allow us to configure this even for the root domain (without the www). This technique break some rules so I wouldn’t recommend it in every case, but in ours, it’s just what we need.

CloudFlare will cache all of our static content from S3 saving us from paying for the majority of the visits to the site. CloudFlare will also compress and optimize our content so it takes less time to reach the browser. Depending on what kind of traffic your site attracts, CloudFlare’s security settings can also protect you from all kinds of resource abuse, malicious traffic, hotlinking, etc.

Note: S3 will not properly identify the mime types for every file which means that some files might not be compressed properly by CloudFlare. You can fix this by changing the metadata for the files in S3. Specifically .ttf, .eot, and other typography related files are a problem.

Frugal Functionality

Having a cheaply hosted static website is nice but static can also be pretty useless. In order to get some functionality out of the site, you could go all jQuery on it but I that that is a road too often traveled these days. I’ve seen too many people include all of jQuery instead of writing 3 lines of JavaScript.

If we want this site to be fast we need to work frugally. If you take a look athttp://donatemyfee.com, you will see some examples of what I call “frugal functionality”.

The social share buttons are static links, not huge JavaScript widgets included from various social networks. Including external scripts is always a bad idea and they always hurt the performance of your site no matter what anyone tells you. Also, the icons and hover animations are CSS typography tricks. No JavaScript and no icon images downloaded.

The site is designed using responsive web design techniques which is “buzzword” for using a bunch of crafty CSS to make the same thing look decent on different sized screens. If we were a large company, I would say “Responsive web is for lazy companies and people without a budget to develop good looking, device targeted sites.” Since we’re on a budget, I’ll say it’s frugal 🙂

Last but not least, we have skimped on all the normal infrastructure that goes behind a website so our options for actually generating leads are a bit thin. We could go very old school with mailto links but in these days where webmail reigns supreme, they are getting pretty useless. Enter Google Forms.

Google Forms

If you haven’t been asked to fill out a Google Form yet, here’s your chance. Google lets you create fairly elaborate forms for free. The forms collect the answers and store them automatically in a Google Drive spreadsheet. There are more sophisticated options for processing the answers, and an entire extension ecosystem being built around the process. For us, the basic solution is more than enough.

Note: You can link to the form or embed it in an iframe. The form will take a bite out of your page load performance (iframes are a huge performance no-no). They will also annoy you with endless warnings, all of which you can nothing about, if you test your site performance with any of the free online services (Webpagetest,Websitetest, GTmetrix, PageSpeed, etc.). In this case, I used some simple (read jQuery-free) JavaScript to load the embeded iframe if it’s requested. This has the added benefit of keeping the user on-site to fill out the form and eliminating the page load time performance hit.

Less is more

Finally, the most important advice about web performance is always “Less is more”. There is no better way to ensure that a page loads quickly than to make it smaller in every way possible. Use less and smaller pictures. Combine, compress and minify everything. Reduce the number of requests.

If you’re interested in getting my help with your site, contact me via LinkedIn or#DonateMyFee . All consulting fees go directly from you to a tax deductible charity in your/your company’s name.

When 99.999% Isn’t Good Enough

When discussing availability of a service, it is common to hear the term “Five Nines” referring to a service being available 99.999% of the time but “Five Nines” are relative. If your time frame is a week, then your service can be unavailable for 6.05 seconds whereas a time frame of a year, allows for a very respectable 5.26 minutes.

In reality, none of those calculations are relevant because no one cares if a service is unavailable for 10 hours, as long as they aren’t trying to use it. On the other hand, if you’re handling 50,000 transactions per second, 6.05 seconds of unavailability could cost you 302,500 transactions and no one cares if you met your SLA.

This problem is one I’ve come up against a number of times in the past and recently even more and the issue is orders of magnitude in IT. The larger the volume of business you handle, the less relevant the Five Nines become.

Google became famous years ago for its novel approach to hardware availability. They were using servers and disks on such a scale that they could no longer prevent the failures and they decided not to even try. Instead, they planned to sustain lots of failures and made a business of knowing when to expect problems and where. As much as we would like to be able to take Google’s approach to things, I think most of our IT budgets aren’t up for it.

Another good example is EMC2 who boast 99.999% availability for their Clariion line of storage systems. I want to start by saying that I use EMC storage and I’m happy with them. Regardless, their claim of 99.999% availability doesn’t give me any comfort for the following reasons.

According to a Whitepaper from 2007 (maybe they have changed things since then) EMC has a team which calculates availability for every Clariion in the field on a weekly basis. Assuming there were 2000 Clariion systems in the field on a given week(the example given in the whitepaper), and across all of them was 1.5 hours of downtime, then:

2000 systems x 7 days x 24 hours   =  336,000 total hours of runtime
336,000 hours - 1.5 hours downtime =  335,998.5 hours of uptime
335,998.5 / 336,000                =  99.9996% uptime

That is great, at least that is what EMC wants you to think. I look at this and understand something totally different. According to this guy, as of the beginning of 2009 there were 300,000 Clariion’s sold- not 2000. That is two orders of magnitude different meaning:

300,000 systems x 7 days x 24 hours   =  50,400,000 total hours of runtime
336,000 hours - 504 hours downtime    =  50,399,496 hours of uptime
50,399,496 / 50,400,000               =  99.999% uptime

Granted, that is a lot of uptime but 504 hours of downtime is still 21 full days of downtime for someone. If it were possible for 21 full days of downtime to fit in one week, they could all be yours and EMC would still be able to claim 99.999% availability according to their calculations. By the same token, 3 EMC customers each week could theoretically have no availability the entire week and one of those customers could be me.

Since storage failures can cause soo many complications, I figure it is much more likely that EMC downtime comes in days as opposed to minutes or hours. Either way, Five Nines is lost in the scale of things in this case as well.

Content Delivery Networks provide another availability vs scale problem. Akamai announced record breaking amounts of traffic on their network in January 2009. They passed 2 terabits and 12,000,000 requests per second. (I don’t use Akamai but I think it is amazing that they delivered over 2 terabits/second of traffic). With that level of traffic, even if Akamai would provide a 99.999% availability SLA, they could have had 120 failed requests per second, 7200 failed requests per minute, etc.

Sometimes complaints relating to our CDN cross my desk and while I have no idea how much traffic our CDN handles world wide, I know that we can easily send it 20,000,000 requests per day. Assuming 99.999% availability, I expect (learning from Google) to have 200 failed requests per day. Knowing IT as I do, I also expect that all 200 failed requests will be in the same country -probably an issue with one of their cache servers which due to GTM will primarily affect people directed to that server, etc. Unfortunately, the issue of scale is lost on our partners who didn’t get their content.

Availability is not the only case where scale is forgotten. I was recently asked to help debug the performance of an application server which could handle a large amount of requests per second when queried directly but only handled 80% of the requests per second when sitting behind a load balancer.

Of course we started by trying to find a reason why the load balancer would be causing a 20% performance hit. After deep investigation the answer I found (not necessarily the correct answer) was that all the load balancing configurations were correct and on average having the load balancer in the path added 1 millisecond to the response time of each request. Unfortunately the response time without the load balancer was an average of 4 milliseconds, so the additional 1 millisecond reduced the overal performance by 20%.

In short, everything is relative and 99.999% isn’t good enough.