Solaris 08/07 – fmd broken on T2000

I recently installed Solaris 08/07 on two T2000 machines and was extremely surprised to find a serious bug with the fmd (Fault Management Daemon) service.

The service would, seemingly at random, fail to start on boot. It wouldn’t actually fail though- it just never finished starting. This caused numerous side effects including that prtdiag, fmdump, and other fault/diagnostic utilities would not work properly. It also seemed to cause problems moving between init levels.

You may have been bitten by this bug if you see some of the following:

bash-3.00# fmadm  faulty
fmadm: failed to connect to fmd: RPC: Program not registered
bash-3.00# prtdiag -v
picl_initialize failed: Daemon not responding
bash-3.00# svcs -xv
svc:/system/fmd:default (Solaris Fault Manager)
State: offline since Mon Oct 08 15:35:25 2007
Reason: Start method is running.
See: http://sun.com/msg/SMF-8000-C4
See: man -M /usr/share/man -s 1M fmd
See: /var/svc/log/system-fmd:default.log
Impact: This service is not running.

This last output from svcs -xv might be normal if it doesn’t stay the same indefinitely. The Start method is running. should finish and the service should go online but if it stays in this state forever- you get the idea.

The next message may or may not be connected. I noticed it several times on boot in conjunction with the fmd failure to start. On the other hand, since the fmd failure caused problems with init levels, I had to sync the system from the ok prompt in order to power off the machine and this message might have been connected to the kernel panic from the previous shutdown.

ds: [ID 406019 kern.notice] NOTICE: [email protected]: invalid message length, 
received 4128 bytes, expected 37536

In the end this issue escalated it’s way back to Sun (after re-installing, re-installing from different media, switching disks, removing additional network cards, and disabling HW RAID, re-installing again, running explorer, realizing explorer didn’t say anything because prtdiag, etc didn’t work.

Solution:They fixed it with an upgraded OBP firmware which was released in October.

Be Sociable, Share!
  • Twitter
  • Facebook
  • email
  • LinkedIn
  • HackerNews
  • Reddit