Tag Archive for load balancing

Ynet on AWS. Let’s hope we don’t have to test their limits.

tightrope

In Israel, more than in most places, no news is good news. Ynet, one of the largest news sites in Israel, recently posted a case study (at the bottom of this article) on handling large loads by moving their notification services to AWS.

“We used EC2, Elastic Load Balancers, and EBS… Us as an enterprise, we need something stable…”

They are contradicting themselves in my opinion. EBS and Elastic Load Balancers (ELB) are the two AWS services which fail the most and fail hardest with multiple downtimes spanning multiple days each.

EBS: Conceptually flawed, prone to cascading failures

EBS, a virtual block storage service, is conceptually flawed and prone to severe cascading failures. In recent years, Amazon has improved reliability somewhat, mainly by providing such a low level of service on standard EBS, that customers are default to paying extra for reserved IOPS and SSD backed EBS volumes.

Many cloud providers avoid the problematic nature of virtual block storage entirely, preferring compute nodes based on local, direct attached storage.

ELB: Too slow to adapt, silently drops your traffic

In my experience, ELBs are too slow to adapt to spikes in traffic. About a year ago, I was called to investigate availability issues with one of our advertising services. The problems were intermittent and extremely hard to pin down. Luckily, as a B2B service, our partners noticed the problems. Our customers would have happily ignored the blank advertising space.

Suspecting some sort of capacity problem, I ran some synthetic load tests and compared the results with logs on our servers. Multiple iterations of these tests with and without ELB in the path confirmed a gruesome and silent loss of 40% of our requests when traffic via Elastic Load Balancers grew suddenly.

The Elastic Load Balancers gave us no indication that they were dropping requests and, although they would theoretically support the load once Amazon’s algorithms picked up on the new traffic, they just didn’t scale up fast enough. We wasted tons of money in bought media that couldn’t show our ads.

Amazon will prepare your ELBs for more traffic if you give them two weeks notice and they’re in a good mood but who has the luxury of knowing when a spike in traffic will come?

Recommendations

I recommend staying away from EC2, EBS, and ELB if you care about performance and availability. There are better, more reliable providers like Joyent. Rackspace without using their cloud block storage (basically the same as EBS with the same flaws) would be my second choice.

If you must use EC2, try to use load balancing AMIs from companies like Riverbed or F5 instead of ELB.

If you must use ELB, make sure you run synthetic load tests at random intervals and make sure that Amazon isn’t dropping your traffic.

Conclusion

In conclusion, let us hope that we have no reasons to test the limits of Ynet’s new services, and if we do, may it only be good news.

How to Host a Screaming Fast Site for $0.03/Month

perf

I had an idea. That’s always how it starts. Before I know it, I’ve purchased the domain name and I’m futzing around with some HTML but where am I going to host it and how much is this going to end up costing me?

That’s where I was when I came up with #DonateMyFee. “This is a site that is only going to cost me money”, I thought to myself (the whole point is for people to donate money rather than paying me). I really didn’t want to start shelling out big (or small) bucks on hosting.

Long story short, here is the recipe for a screaming fast website on a low budget:

Amazon S3

I’m not a huge fan of Amazon AWS, but S3 is useful enough to make it into my good graces. S3 is Amazon’s storage service. You upload static files into “buckets” and S3 can hold on to them, version them, and most importantly serve them via http. When configured to serve a bucket as a static website, S3 can be used to replace the load balancing and web serving infrastructure needed to serve a static website.

There are only two problems with that.

  1. You pay for S3 by the amount of traffic pulled from your bucket.
  2. Your “website” will be called something crazy ugly like donatemyfee.com.s3-website-eu-west-1.amazonaws.com

Regarding the price, S3 tries to get you three ways. They charge for the volume of the data being stored, for the number of requests made, and for the volume of the request throughput in GB. That said, the prices are very reasonable if we can keep the number of requests low. For that reason, a CDN is an absolute must. The CDN will also solve our second problem – the unfriendly S3 website name.

Often S3 is paired with Amazon’s CDN, Cloudfront, but I don’t recommend it. Cloudfront is expensive as CDN’s go and we’re on a budget. Even if we wanted to pay for the CDN, there are better performing options for less. CloudFlare is a great alternative with a free plan that will do us wonders.

CloudFlare

CloudFlare is one of several CDN by proxy + Webapp Firewall solutions that cropped up several years ago. Since the beginning, they have had a free plan and they have proven to be both innovative and competitive.

To use CloudFlare , we need to set their servers as your domain’s DNS name servers which can be a deal breaker in some cases. Once that’s setup we create a CNAME record in CloudFlare which points to the ugly S3 website name. CloudFlare has a new CNAME flattening technique which will allow us to configure this even for the root domain (without the www). This technique break some rules so I wouldn’t recommend it in every case, but in ours, it’s just what we need.

CloudFlare will cache all of our static content from S3 saving us from paying for the majority of the visits to the site. CloudFlare will also compress and optimize our content so it takes less time to reach the browser. Depending on what kind of traffic your site attracts, CloudFlare’s security settings can also protect you from all kinds of resource abuse, malicious traffic, hotlinking, etc.

Note: S3 will not properly identify the mime types for every file which means that some files might not be compressed properly by CloudFlare. You can fix this by changing the metadata for the files in S3. Specifically .ttf, .eot, and other typography related files are a problem.

Frugal Functionality

Having a cheaply hosted static website is nice but static can also be pretty useless. In order to get some functionality out of the site, you could go all jQuery on it but I that that is a road too often traveled these days. I’ve seen too many people include all of jQuery instead of writing 3 lines of JavaScript.

If we want this site to be fast we need to work frugally. If you take a look athttp://donatemyfee.com, you will see some examples of what I call “frugal functionality”.

The social share buttons are static links, not huge JavaScript widgets included from various social networks. Including external scripts is always a bad idea and they always hurt the performance of your site no matter what anyone tells you. Also, the icons and hover animations are CSS typography tricks. No JavaScript and no icon images downloaded.

The site is designed using responsive web design techniques which is “buzzword” for using a bunch of crafty CSS to make the same thing look decent on different sized screens. If we were a large company, I would say “Responsive web is for lazy companies and people without a budget to develop good looking, device targeted sites.” Since we’re on a budget, I’ll say it’s frugal 🙂

Last but not least, we have skimped on all the normal infrastructure that goes behind a website so our options for actually generating leads are a bit thin. We could go very old school with mailto links but in these days where webmail reigns supreme, they are getting pretty useless. Enter Google Forms.

Google Forms

If you haven’t been asked to fill out a Google Form yet, here’s your chance. Google lets you create fairly elaborate forms for free. The forms collect the answers and store them automatically in a Google Drive spreadsheet. There are more sophisticated options for processing the answers, and an entire extension ecosystem being built around the process. For us, the basic solution is more than enough.

Note: You can link to the form or embed it in an iframe. The form will take a bite out of your page load performance (iframes are a huge performance no-no). They will also annoy you with endless warnings, all of which you can nothing about, if you test your site performance with any of the free online services (Webpagetest,Websitetest, GTmetrix, PageSpeed, etc.). In this case, I used some simple (read jQuery-free) JavaScript to load the embeded iframe if it’s requested. This has the added benefit of keeping the user on-site to fill out the form and eliminating the page load time performance hit.

Less is more

Finally, the most important advice about web performance is always “Less is more”. There is no better way to ensure that a page loads quickly than to make it smaller in every way possible. Use less and smaller pictures. Combine, compress and minify everything. Reduce the number of requests.

If you’re interested in getting my help with your site, contact me via LinkedIn or#DonateMyFee . All consulting fees go directly from you to a tax deductible charity in your/your company’s name.

Vendor Lock-In or One Stop Shop

I was recently discussing load balancers with someone. I said I was much happier with F5 than I was with Cisco and he countered that although he preferred F5 head to head, going with Cisco for all the network was better for them in the long run.

The situation with storage is similar. EMC makes a great SAN but a pretty bad NAS. Is it worth getting EMC”s NAS for the One Stop Shop factor?

Since Oracle’s acquisition of Sun, I’ve been looking forward to the success of their “One Stop Shop” philosophy. Successfully bringing all their offerings under one roof promises better and faster support all around.

Unfortunately, it has been almost a year and Oracle is still not sure how they are to unify the customer support systems. New support contracts don’t work in either system.  To make things a little less clear, Oracle recently announced that everything will be migrated to “My Oracle Support” but they don’t know when- very reassuring.

A simple pattern emerges. One Stop Shop is a dream for IT people. Support is hard enough to get when you’ve isolated a problem to a specific vendor. It is even harder when your problems are between two vendors and each points the finger at the other.

When does the One Stop Shop strategy become a rationalization for Vendor Lock-In? It is a delicate balance around how much better your IT could be with Best of Breed vs. how much worse they will be integrating all the different pieces of the puzzle.

Regarding Cisco vs. F5, I’m also pretty happy letting Cisco handle everything Layer 3 and under and I don’t worry too much about the integration. I’m also optimistic regarding Sun and Oracle. I think they’ll have the wrinkles ironed out by the second half of 2011. If they don’t, it will be a serious let down.

When 99.999% Isn’t Good Enough

When discussing availability of a service, it is common to hear the term “Five Nines” referring to a service being available 99.999% of the time but “Five Nines” are relative. If your time frame is a week, then your service can be unavailable for 6.05 seconds whereas a time frame of a year, allows for a very respectable 5.26 minutes.

In reality, none of those calculations are relevant because no one cares if a service is unavailable for 10 hours, as long as they aren’t trying to use it. On the other hand, if you’re handling 50,000 transactions per second, 6.05 seconds of unavailability could cost you 302,500 transactions and no one cares if you met your SLA.

This problem is one I’ve come up against a number of times in the past and recently even more and the issue is orders of magnitude in IT. The larger the volume of business you handle, the less relevant the Five Nines become.

Google became famous years ago for its novel approach to hardware availability. They were using servers and disks on such a scale that they could no longer prevent the failures and they decided not to even try. Instead, they planned to sustain lots of failures and made a business of knowing when to expect problems and where. As much as we would like to be able to take Google’s approach to things, I think most of our IT budgets aren’t up for it.

Another good example is EMC2 who boast 99.999% availability for their Clariion line of storage systems. I want to start by saying that I use EMC storage and I’m happy with them. Regardless, their claim of 99.999% availability doesn’t give me any comfort for the following reasons.

According to a Whitepaper from 2007 (maybe they have changed things since then) EMC has a team which calculates availability for every Clariion in the field on a weekly basis. Assuming there were 2000 Clariion systems in the field on a given week(the example given in the whitepaper), and across all of them was 1.5 hours of downtime, then:

2000 systems x 7 days x 24 hours   =  336,000 total hours of runtime
336,000 hours - 1.5 hours downtime =  335,998.5 hours of uptime
335,998.5 / 336,000                =  99.9996% uptime

That is great, at least that is what EMC wants you to think. I look at this and understand something totally different. According to this guy, as of the beginning of 2009 there were 300,000 Clariion’s sold- not 2000. That is two orders of magnitude different meaning:

300,000 systems x 7 days x 24 hours   =  50,400,000 total hours of runtime
336,000 hours - 504 hours downtime    =  50,399,496 hours of uptime
50,399,496 / 50,400,000               =  99.999% uptime

Granted, that is a lot of uptime but 504 hours of downtime is still 21 full days of downtime for someone. If it were possible for 21 full days of downtime to fit in one week, they could all be yours and EMC would still be able to claim 99.999% availability according to their calculations. By the same token, 3 EMC customers each week could theoretically have no availability the entire week and one of those customers could be me.

Since storage failures can cause soo many complications, I figure it is much more likely that EMC downtime comes in days as opposed to minutes or hours. Either way, Five Nines is lost in the scale of things in this case as well.

Content Delivery Networks provide another availability vs scale problem. Akamai announced record breaking amounts of traffic on their network in January 2009. They passed 2 terabits and 12,000,000 requests per second. (I don’t use Akamai but I think it is amazing that they delivered over 2 terabits/second of traffic). With that level of traffic, even if Akamai would provide a 99.999% availability SLA, they could have had 120 failed requests per second, 7200 failed requests per minute, etc.

Sometimes complaints relating to our CDN cross my desk and while I have no idea how much traffic our CDN handles world wide, I know that we can easily send it 20,000,000 requests per day. Assuming 99.999% availability, I expect (learning from Google) to have 200 failed requests per day. Knowing IT as I do, I also expect that all 200 failed requests will be in the same country -probably an issue with one of their cache servers which due to GTM will primarily affect people directed to that server, etc. Unfortunately, the issue of scale is lost on our partners who didn’t get their content.

Availability is not the only case where scale is forgotten. I was recently asked to help debug the performance of an application server which could handle a large amount of requests per second when queried directly but only handled 80% of the requests per second when sitting behind a load balancer.

Of course we started by trying to find a reason why the load balancer would be causing a 20% performance hit. After deep investigation the answer I found (not necessarily the correct answer) was that all the load balancing configurations were correct and on average having the load balancer in the path added 1 millisecond to the response time of each request. Unfortunately the response time without the load balancer was an average of 4 milliseconds, so the additional 1 millisecond reduced the overal performance by 20%.

In short, everything is relative and 99.999% isn’t good enough.