Month: November 2007

Persistent static routes in Solaris 10 11/06, 08/07

Static routes are a very common necessity once your networks become even a little complex. Whether you need to route specific traffic over a VPN or setup specific test addresses for IPMP failover, static routes are indispensable.

For many years the “correct” way of configuring static routes in Solaris has been to create an init.d script which ran the ‘route add’ commands.

As of Solaris 10 11/06, a more reasonable approach has been implemented. The ‘route’ command has a new option ‘-p’.

Make changes to the network route tables persistent across system restarts. The operation is applied to the network routing tables first and, if successful, is then applied to the list of saved routes used at system startup. In determining whether an operation was successful, a failure to add a route that already exists or to delete a route that is not in the routing table is ignored. Particular care should be taken when using host or network names in persistent routes, as network-based name resolution services are not available at the time routes are added at startup.

Now you may be asking “Where is my configuration file?” The route command currently stores your static routes in the file /etc/inet/static_routes but this has been declared volatile. Sun is not promising to keep these configurations in that file or in the same format from release to release.

I personally am not happy with Sun’s general move to administrative utilities for configuration as opposed to configuration files. I agree that utilities are useful. They ensure correct syntax, etc. but I want the ability to configure a system on the file system level as well. Otherwise I loose the ability to keep a system’s configuration files in version control. I loose the ability to deploy a system by transferring the appropriate files (ala scp, cfengine, puppet, home grown script, etc.) I prefer something along the lines of crontab where the syntax is checked but the configuration itself is a file in userspace.

Still, a standard method for configuring static routes is welcome in place of creating init scripts, especially with SMF services phasing out init scripts altogether.

SUNOS-8000-1L Errors caused by nxge driver for X4447A-z

I recently installed Solaris 08/07 on a T2000 with a Sun Quad GbE x8 PCIe Low Profile Adapter (X4447A-z) inside. The machine gave lots of problems.

One of the issues was the following message which the machine logged hundreds if not thousands of times:

Oct 23 22:18:27 hostname fmd: [ID 441519 daemon.error] SUNW-MSG-ID: SUNOS-8000-1L,
TYPE: Defect, VER: 1, SEVERITY: Minor
Oct 23 22:18:27 hostname EVENT-TIME: Tue Oct 23 22:18:27 BST 2007
Oct 23 22:18:27 hostname PLATFORM: SUNW,Sun-Fire-T200, CSN: -, HOSTNAME: hostname
Oct 23 22:18:27 hostname SOURCE: eft, REV: 1.16
Oct 23 22:18:27 hostname EVENT-ID: 86cc16cc-a356-6a94-a11b-bbc8cd5e456f
Oct 23 22:18:27 hostname DESC: The EFT Diagnosis Engine encountered telemetry
for which it is unable to produce a diagnosis. Refer to
http://sun.com/msg/SUNOS-8000-1L for more information.
Oct 23 22:18:27 hostname AUTO-RESPONSE: Error reports from the component will be
logged for examination by Sun.
Oct 23 22:18:27 hostname IMPACT: Automated diagnosis and response for these
events will not occur.
Oct 23 22:18:27 hostname REC-ACTION: Run pkgchk -n SUNWfmd to ensure that
fault management software is installed properly. Contact Sun for support.

I originally assumed that these very descriptive messages were part of the same problem with the fmd service which I mentioned in a previous post but Sun found another source for the problem. Apparently it is the nxge driver.
As I write this entry, Sun is working on a new driver. They tried a test version on my server and it did not solve the problem but it does seem to lessen the number of errors and add some information to the logs specifically, the entries above are sometimes preceded by a line similar to this:

nxge: [ID 752849 kern.warning] WARNING: nxge2 : nxge_ipp_err_evnts: pkt_dis_max

In the meantime, it seems that I will be ditching the quad cards until Sun can get their act together. I’m getting them replaced by two dual gigabit cards which use the e1000g driver.

Solaris 08/07 – fmd broken on T2000

I recently installed Solaris 08/07 on two T2000 machines and was extremely surprised to find a serious bug with the fmd (Fault Management Daemon) service.

The service would, seemingly at random, fail to start on boot. It wouldn’t actually fail though- it just never finished starting. This caused numerous side effects including that prtdiag, fmdump, and other fault/diagnostic utilities would not work properly. It also seemed to cause problems moving between init levels.

You may have been bitten by this bug if you see some of the following:

bash-3.00# fmadm  faulty
fmadm: failed to connect to fmd: RPC: Program not registered
bash-3.00# prtdiag -v
picl_initialize failed: Daemon not responding
bash-3.00# svcs -xv
svc:/system/fmd:default (Solaris Fault Manager)
State: offline since Mon Oct 08 15:35:25 2007
Reason: Start method is running.
See: http://sun.com/msg/SMF-8000-C4
See: man -M /usr/share/man -s 1M fmd
See: /var/svc/log/system-fmd:default.log
Impact: This service is not running.

This last output from svcs -xv might be normal if it doesn’t stay the same indefinitely. The Start method is running. should finish and the service should go online but if it stays in this state forever- you get the idea.

The next message may or may not be connected. I noticed it several times on boot in conjunction with the fmd failure to start. On the other hand, since the fmd failure caused problems with init levels, I had to sync the system from the ok prompt in order to power off the machine and this message might have been connected to the kernel panic from the previous shutdown.

ds: [ID 406019 kern.notice] NOTICE: [email protected]: invalid message length, 
received 4128 bytes, expected 37536

In the end this issue escalated it’s way back to Sun (after re-installing, re-installing from different media, switching disks, removing additional network cards, and disabling HW RAID, re-installing again, running explorer, realizing explorer didn’t say anything because prtdiag, etc didn’t work.

Solution:They fixed it with an upgraded OBP firmware which was released in October.