Tag: linux

Making Path Persistent

I’ve been paying a lot of attention to this site since I switched platforms and somehow people are finding some fairly irrelevant content on my site for the search terms making path persistent in solaris 10 so I figured I better put some real answers up.

It is hard to know exactly what kind of path they had in mind- were they referring to the standard PATH variable which lists the directories in which to search for executables or were they referring to something more complicated?

You can make the executable search PATH variable persistent in several ways:

  1. On the system level you can set it in the /etc/profile file. It will affect all users except maybe root.
  2. On a per user level, or for the user root, you can set the PATH in the .profile file in the user’s home directory

Xlib: PuTTY X11 proxy: wrong authentication protocol attempted

While setting up some developers with remote SmartSVN via X over SSH using Plink, I ran into the following error:

Xlib: PuTTY X11 proxy: wrong authentication protocol attempted

SmartSVN couldn’t connect to the tunneled X server display. I was extremely confused since I’d been using X tunneling successfully with SecureCRT. After googling the error message a little bit, it seems that the part about the “wrong authentication protocol attempted” is  misleading. You could get this message for not having the right magic cookie on the client side, or for not having a cookie at all as was apparently my case.

In my case, the developers are being authenticated against Active Directory via Samba/Winbind. Their home directories are non-existent until the first time they login via SSH. When using XForwarding over SSH, the ssh daemon on the server usually handles setting the DISPLAY and authentication cookies but in my case, it was trying to set up the cookies before the user’s home directory was created.

With some more digging, I found the $HOME/.ssh/rc and /etc/ssh/sshrc files which allows you to replace the standard XForwarding process with you custom process. Paraphrased from the sshd man page:

The primary purpose of $HOME/.ssh/rc is to run any initialization routines that might be needed before the user’s home directory becomes accessible; AFS is a particular example of such an environment…

If X11 forwarding is in use, it will receive the proto cookie pair in its standard input and DISPLAY in its environment. The script must call xauth because sshd will not run xauth automatically to add X11 cookies…

This file will probably contain some initialization code followed by something similar to:

if read proto cookie && [ -n "$DISPLAY" ]
then
  if [ `echo $DISPLAY | cut -c1-10`  =  'localhost:' ]
  then
    # X11UseLocalhost=yes
    echo add unix:`echo $DISPLAY |
    cut -c11-` $proto $cookie
  else
    # X11UseLocalhost=no
    echo add $DISPLAY $proto $cookie
  fi | xauth -q -
fi

If this file does not exist, /etc/ssh/sshrc is run, and if that does not exist, xauth is used to store the cookie…

/etc/ssh/sshrc : Similar to $HOME/.ssh/rc. This can be used to specify machine-specific login-time initializations globally.

I pretty much cut and paste the code from the man page with two caveats-

  1. I used the full path to the xauth binary in the second to last line.
  2. I added the process to create the user’s home directory before the xauth.

That done, Plink was able to setup the XForwarding tunnel without a problem. I still can’t explain why Plink failed in the first place while SecureCRT had no problems with having the home directories appear later in the login process.

Additional Reading – X Protocol background:

X is a client-server protocol. The client program connects to a DISPLAY (usually defined in the similarly named environment variable) which represents the server displaying the GUI. Technically the display can refer to an X server on the same machine, on a remote machine on the same LAN, or even a server located across the Internet.

In order for a client to successfully connect to a display, the client needs to be authorized using either host authentication, cookie authentication, or user authentication. Host authentication allows all connections to an X server from one or more hosts/ip addresses. This is extremely insecure and should not be used. User authentication requires the client to authenticate as a user (using Kerberos for example) with authorization to access the X server. The most common authentication used with X servers is cookie authentication which basically uses a pre-shared key to authenticate clients. If your client knows the key, it gets in. If not, not.

In most cases, ie. every Linux desktop installation, the X server and client are on the same machine so both the server and client can easily look at the cookie in the user’s home directory. In the case of a remote connection (a purely X protocol connection between client and server over the network), the user will have to copy the cookie from the server side to the client side using the xauth utilities. Since the advent of SSH and XForwarding, this process has pretty much gone to pasture. The ssh client and ssh daemon are now mostly responsible for setting up authentication on the tunneled X connection although in cases like the one above, administrators might have to help things along.

EMC Replication Manager in Solaris

UPDATE: No ZFS Support for Replication Manager in the near future

Using storage level snapshots can be used to run backups without directly requiring resources from the original host.

EMC Replication Manager coordinates the creation of application consistent snapshots across all the hosts in your network. It handles scheduling creation/expiration of snapshots,  mounting and unmounting from backup servers, etc. from a single console.

Although it is not tightly integrated into EMC Networker like the similar Networker PowerSnap module, it can be used to start a backup process after taking a new snapshot and it has the capability to manage snapshots unrelated to backups from a GUI.

While the data sheet claims support for Solaris, there are several caveats which I have run into.

  1. There is no mention of ZFS support in the data sheet and apparently, there is no support in the software either. One would expect this to be a non-question since ZFS has been part of Solaris since 2006.
  2. The data sheet is missing the word “SPARC” next to the word Solaris. There is no support for x86.

Honestly, this has put a dent in my plans since my backup server is an x86 box. I’m hoping the lack of ZFS support will work out as long as we can script any FS specific magic we need. I don’t have an option of running something like Linux on it (just to get the software working) because I won’t be able to even mount the ZFS filesystems- let alone back them up.

In the meantime, I’ll have to move my backups to a SPARC server and considering the lack of low end SPARC machines, I’ll have to allocate something way too expensive to be a backup server.

When 99.999% Isn’t Good Enough

When discussing availability of a service, it is common to hear the term “Five Nines” referring to a service being available 99.999% of the time but “Five Nines” are relative. If your time frame is a week, then your service can be unavailable for 6.05 seconds whereas a time frame of a year, allows for a very respectable 5.26 minutes.

In reality, none of those calculations are relevant because no one cares if a service is unavailable for 10 hours, as long as they aren’t trying to use it. On the other hand, if you’re handling 50,000 transactions per second, 6.05 seconds of unavailability could cost you 302,500 transactions and no one cares if you met your SLA.

This problem is one I’ve come up against a number of times in the past and recently even more and the issue is orders of magnitude in IT. The larger the volume of business you handle, the less relevant the Five Nines become.

Google became famous years ago for its novel approach to hardware availability. They were using servers and disks on such a scale that they could no longer prevent the failures and they decided not to even try. Instead, they planned to sustain lots of failures and made a business of knowing when to expect problems and where. As much as we would like to be able to take Google’s approach to things, I think most of our IT budgets aren’t up for it.

Another good example is EMC2 who boast 99.999% availability for their Clariion line of storage systems. I want to start by saying that I use EMC storage and I’m happy with them. Regardless, their claim of 99.999% availability doesn’t give me any comfort for the following reasons.

According to a Whitepaper from 2007 (maybe they have changed things since then) EMC has a team which calculates availability for every Clariion in the field on a weekly basis. Assuming there were 2000 Clariion systems in the field on a given week(the example given in the whitepaper), and across all of them was 1.5 hours of downtime, then:

2000 systems x 7 days x 24 hours   =  336,000 total hours of runtime
336,000 hours - 1.5 hours downtime =  335,998.5 hours of uptime
335,998.5 / 336,000                =  99.9996% uptime

That is great, at least that is what EMC wants you to think. I look at this and understand something totally different. According to this guy, as of the beginning of 2009 there were 300,000 Clariion’s sold- not 2000. That is two orders of magnitude different meaning:

300,000 systems x 7 days x 24 hours   =  50,400,000 total hours of runtime
336,000 hours - 504 hours downtime    =  50,399,496 hours of uptime
50,399,496 / 50,400,000               =  99.999% uptime

Granted, that is a lot of uptime but 504 hours of downtime is still 21 full days of downtime for someone. If it were possible for 21 full days of downtime to fit in one week, they could all be yours and EMC would still be able to claim 99.999% availability according to their calculations. By the same token, 3 EMC customers each week could theoretically have no availability the entire week and one of those customers could be me.

Since storage failures can cause soo many complications, I figure it is much more likely that EMC downtime comes in days as opposed to minutes or hours. Either way, Five Nines is lost in the scale of things in this case as well.

Content Delivery Networks provide another availability vs scale problem. Akamai announced record breaking amounts of traffic on their network in January 2009. They passed 2 terabits and 12,000,000 requests per second. (I don’t use Akamai but I think it is amazing that they delivered over 2 terabits/second of traffic). With that level of traffic, even if Akamai would provide a 99.999% availability SLA, they could have had 120 failed requests per second, 7200 failed requests per minute, etc.

Sometimes complaints relating to our CDN cross my desk and while I have no idea how much traffic our CDN handles world wide, I know that we can easily send it 20,000,000 requests per day. Assuming 99.999% availability, I expect (learning from Google) to have 200 failed requests per day. Knowing IT as I do, I also expect that all 200 failed requests will be in the same country -probably an issue with one of their cache servers which due to GTM will primarily affect people directed to that server, etc. Unfortunately, the issue of scale is lost on our partners who didn’t get their content.

Availability is not the only case where scale is forgotten. I was recently asked to help debug the performance of an application server which could handle a large amount of requests per second when queried directly but only handled 80% of the requests per second when sitting behind a load balancer.

Of course we started by trying to find a reason why the load balancer would be causing a 20% performance hit. After deep investigation the answer I found (not necessarily the correct answer) was that all the load balancing configurations were correct and on average having the load balancer in the path added 1 millisecond to the response time of each request. Unfortunately the response time without the load balancer was an average of 4 milliseconds, so the additional 1 millisecond reduced the overal performance by 20%.

In short, everything is relative and 99.999% isn’t good enough.

Sun’s Predicament

I’ve been working with Unix for a fairly long time now- about 13 years.

I’ll admit that I started with Linux and thought it was light years ahead of SunOS 4.x running on those old SPARC machines- I mean who had heard of SPARC processors? I remember my boss trying to explain to me that even an older SPARC processor was more powerful than a newer Intel Pentium processor. I didn’t really believe him. In time, I convinced them to get rid of most of their SPARC/Solaris in favor of the hip, free, and cheap Intel/Linux combination.

Now I see that I couldn’t have been more wrong. I realize that SunOS 4.x probably still has features which I don’t know how to use properly. When I look at Solaris 10, ZFS, Zones, LDOMS, DTrace, etc. I not really sure you could pay me to work with Linux (that would be soo depressing). That isn’t even mentioning the SPARC hardware it runs on- Can any Intel server compare to a T5140???

That’s why the current situation with Sun absolutely SUCKS (pardon my french)! I’m sure there are a lot of admins out there who feel the same way. If this Oracle deal doesn’t go through and Sun disappears because of it, it will be our loss. We’ll be stuck with mediocre operating systems and commodity hardware and I really hope it doesn’t happen.

That said, I’d like to say thanks to all the people at Sun who are still turning out crazy cool technologies despite the problems.