Tag Archive for performance

Issues Running Java Programs Remotely via X

Recently I started using SmartSVN running remotely on a Solaris 10 server and displaying on my Windows XP machine via Xming (the free Xserver). I quickly ran into some performance and usability issues which I hope to have solved.

  • Performance:
    1. Xming uses OpenGL for rendering by default. I added the following option to the java command line: -Dsun.java2d.opengl=true
      I’m not sure this technically made a difference but a developer claimed a 300% performance increase??
    2. Since I am working over a LAN, I disabled compression for the SSH connection being used to tunnel X and set my preferred cipher to Blowfish. Again- I’m not sure this made a difference but theoretically it should be faster.
  • Usability:
    1. There is an apparently known issue with running Java programs via X which causes all sorts of problems in painting the windows correctly. In my case, the mouse position was being shown in one place but the menus were being highlighted about 50 pixels lower. It took me a while to find it in this postabout Swing on Remote X and this related post about cursors and menus in NetBeans. Of course, after finding it in Google, I found the following comments in the SmartSvn start script:
      # If you experience problems, e.g. incorrectly painted windows,
      # try to uncomment one of the following two lines
      #export AWT_TOOLKIT=MToolkit
      #export AWT_TOOLKIT=XToolkit

      Uncommenting the export AWT_TOOLKIT=MToolkit did the trick. Now the menus are much more responsive and the cursors show in the correct position.

I’m not sure if the problem here is in XToolkit or in SmartSVN. I’m also not sure what this will mean in future versions of SmartSVN. Apparently JDK7 will not support MToolkit so either XToolkit has to work properly by then or SmartSVN has to be fixed to use XToolkit properly.

Caveats on Using Snapshots for Server-less Backups

Whether you are dealing with disk I/O in reading the data from the disks, or CPU for compressing or encrypting the data (or both- remember to compress and then encrypt!), or network for transferring the data to a backup server, the added load of a backup on your production servers is unwelcome. For this reason, the period of time during which backups can be made, aka. backup window, may be limited- even severely.

You may say, “It only takes me X hours to do a full backup of everything”, but over time backup windows are notorious for becoming too small. Backups are split over multiple days, technologies upgraded, etc. When planning a backup strategy, my approach is to eliminate the backup window altogether- that is do whatever you can to take the backup off the production hardware altogether.

Storage Snapshots are one method for taking the production servers out of the backup equation. By creating a consistent, point in time snapshot on your storage, and mounting it on your backup server, you can backup your data using your backup server’s resources while your production servers continue as usual.

Caveats of this method in general are:

  1. Most snapshot technologies are some form of “Copy On Write”. This means that after you take a snapshot, the data from any area written to on the disks will first be copied somewhere else for safe keeping and then be overwritten.
    • This may cause a performance hit on your production system as you are generating extra IO on every write.
    • As long as the data being used in production has not changed significantly from the snapshot, your backups will still be sending the majority of their read operations to the same physical disks being used by production so this doesn’t relieve the backup load on the storage as much as it relieves the load on the servers.
  2. Key word is “consistent”.
    • You do not want to be where KDE developers were when ext4 was released. Depending on the applications or systems you are trying to backup, you may need to “quiet” them (FLUSH TABLES WITH READ LOCK,  ALTER TABLESPACE <tablespacename> BEGIN BACKUP, etc.)
    • If your application, ie. Oracle Database uses Datafiles or ASM spread over several LUNs, then all your storage level snapshots probably need to be taken together in order for the DB itself to remain consistent. For more, look at “Consistency Groups.”
  3. Once you have the snapshot, your backup server needs to see the snapshot LUN, and be able to mount the filesystem on the LUN. If your backup server doesn’t run the same operating system as your production servers this may be an issue. Ie. Try convincing a Windows server to mount a ZFS Pool (I dare you).

Anyway- these are just some things to look out for when you want to use storage level snapshots to backup servers without loading the production systems themselves. In another post I’ll touch on some EMC infrastructure specifics to look out for.

When 99.999% Isn’t Good Enough

When discussing availability of a service, it is common to hear the term “Five Nines” referring to a service being available 99.999% of the time but “Five Nines” are relative. If your time frame is a week, then your service can be unavailable for 6.05 seconds whereas a time frame of a year, allows for a very respectable 5.26 minutes.

In reality, none of those calculations are relevant because no one cares if a service is unavailable for 10 hours, as long as they aren’t trying to use it. On the other hand, if you’re handling 50,000 transactions per second, 6.05 seconds of unavailability could cost you 302,500 transactions and no one cares if you met your SLA.

This problem is one I’ve come up against a number of times in the past and recently even more and the issue is orders of magnitude in IT. The larger the volume of business you handle, the less relevant the Five Nines become.

Google became famous years ago for its novel approach to hardware availability. They were using servers and disks on such a scale that they could no longer prevent the failures and they decided not to even try. Instead, they planned to sustain lots of failures and made a business of knowing when to expect problems and where. As much as we would like to be able to take Google’s approach to things, I think most of our IT budgets aren’t up for it.

Another good example is EMC2 who boast 99.999% availability for their Clariion line of storage systems. I want to start by saying that I use EMC storage and I’m happy with them. Regardless, their claim of 99.999% availability doesn’t give me any comfort for the following reasons.

According to a Whitepaper from 2007 (maybe they have changed things since then) EMC has a team which calculates availability for every Clariion in the field on a weekly basis. Assuming there were 2000 Clariion systems in the field on a given week(the example given in the whitepaper), and across all of them was 1.5 hours of downtime, then:

2000 systems x 7 days x 24 hours   =  336,000 total hours of runtime
336,000 hours - 1.5 hours downtime =  335,998.5 hours of uptime
335,998.5 / 336,000                =  99.9996% uptime

That is great, at least that is what EMC wants you to think. I look at this and understand something totally different. According to this guy, as of the beginning of 2009 there were 300,000 Clariion’s sold- not 2000. That is two orders of magnitude different meaning:

300,000 systems x 7 days x 24 hours   =  50,400,000 total hours of runtime
336,000 hours - 504 hours downtime    =  50,399,496 hours of uptime
50,399,496 / 50,400,000               =  99.999% uptime

Granted, that is a lot of uptime but 504 hours of downtime is still 21 full days of downtime for someone. If it were possible for 21 full days of downtime to fit in one week, they could all be yours and EMC would still be able to claim 99.999% availability according to their calculations. By the same token, 3 EMC customers each week could theoretically have no availability the entire week and one of those customers could be me.

Since storage failures can cause soo many complications, I figure it is much more likely that EMC downtime comes in days as opposed to minutes or hours. Either way, Five Nines is lost in the scale of things in this case as well.

Content Delivery Networks provide another availability vs scale problem. Akamai announced record breaking amounts of traffic on their network in January 2009. They passed 2 terabits and 12,000,000 requests per second. (I don’t use Akamai but I think it is amazing that they delivered over 2 terabits/second of traffic). With that level of traffic, even if Akamai would provide a 99.999% availability SLA, they could have had 120 failed requests per second, 7200 failed requests per minute, etc.

Sometimes complaints relating to our CDN cross my desk and while I have no idea how much traffic our CDN handles world wide, I know that we can easily send it 20,000,000 requests per day. Assuming 99.999% availability, I expect (learning from Google) to have 200 failed requests per day. Knowing IT as I do, I also expect that all 200 failed requests will be in the same country -probably an issue with one of their cache servers which due to GTM will primarily affect people directed to that server, etc. Unfortunately, the issue of scale is lost on our partners who didn’t get their content.

Availability is not the only case where scale is forgotten. I was recently asked to help debug the performance of an application server which could handle a large amount of requests per second when queried directly but only handled 80% of the requests per second when sitting behind a load balancer.

Of course we started by trying to find a reason why the load balancer would be causing a 20% performance hit. After deep investigation the answer I found (not necessarily the correct answer) was that all the load balancing configurations were correct and on average having the load balancer in the path added 1 millisecond to the response time of each request. Unfortunately the response time without the load balancer was an average of 4 milliseconds, so the additional 1 millisecond reduced the overal performance by 20%.

In short, everything is relative and 99.999% isn’t good enough.

The Systems Architect

What is a Systems Architect?

Systems Architects have years of experience in the various parts of the systems they work with. Most probably, they specialize in a specific area area of expertise where they began their careers, but have since expanded their knowledge by learning from their colleagues and from life’s lessons. To get to their position, they have proven their ability to analyze and understand the needs and constraints of the business they work in. They are responsible for deciding what technologies will provide the best solutions for a business, how to integrate them with existing systems, and how to retire them when they are obsolete.

There are several flavors of Systems Architects.

  1. The Bang-For-Your-Buck Architect– They can cut your OP-EX by 90% and consolidate all your servers on to two desktops running VMWare if you give them enough time and rewrite your application in assembler. They live and get fired by the Project Management Triangle, always aiming for High Quality, Cheap, but Slower than molasses.
  2. The Unbreakable Architect– Their systems do not know the meaning of the word Downtime but come at the price of top of the line co-location, ISP level network equipment, 2n+1 hardware, months of code freezes, release testing cycles, staging and performance test runs, security scans and rejects.
  3. The Performance Architect– Optimization is the name of the game. Your site will be in the customer’s browser before they can blink (remember no flash, no heavy images, no javascript and no calls to google analytics). Your database will not have time to read from the disk before it answers your queries (you didn’t really invest in multiple Oracle RAC nodes to access the same data at the same time? everyone knows that just slows the database down!)
  4. The All-Of-The-Above Architect– Everything is important in the right balance. Consolidation and cutting costs is a factor but so is ease of implementation. The system should be flexible enough to handle new requirements quickly but getting something done quickly is not an excuse for poor implementation. The system must be robust but not every risk is worth mitigating. Cost/Performance is quantitative. It should be measured and improved if necessary in the most cost effective manner. The work of finding balance in a system never ends but the most important thing is to always move towards the end goal.

If you know a Systems Architect like those described above, please have pity on him. He is really trying to do what he thinks is best and he really does (probably) have a lot of experience.

Next time- “The Software Architect”