> My last crash, using 2.0.27 happened while i was reading mail with pine.
> I got messages with: fork: try again. Any command would fail to fork,
> and i wasn't logged in as root. I waited a couple of minutes but
> watchdog (which I am assuming couldn't fork either) didn't do a cold
> reboot.
Watchdog doesn't need to fork (except maybe when you start it) to do its
job, which is simply to write something to /dev/watchdog on a regular
interval. It will continue running just fine when the process table is
full.
> In the past, we've had some problems with our ncr53c810, so I think
> (though the logs dont show any trace of anything) that was the cause
> this time as well (It didn't recover e2fscking our /tmp partition, and
> I had to use a boot/root disk to e2fsck it manually before our server
> would boot again).
> If the scsi bus is totally locked up, will the kernel watchdog routines
> still be effective? I can imagine it can't check the /dev/watchdog
> device anymore, which is on a scsi disk.
The only part of /dev/watchdog that is actually on the disk are the device
numbers in the /dev directory. Once the device is opened by watchdog, it
shouldn't need the disk any more (although I think it's still marked "in
use" by reference counting).
Software watchdog will only catch a small subset of crashes. A crash must
leave the kernel intact enough to execute the timer routine that checks if
the watchdog timer has expired but the watchdog user program has stopped
running but has left /dev/watchdog open (optional). If you need more than
that, you'll have to look at a hardware solution.
-- "Love the dolphins," she advised him. "Write by W.A.S.T.E.."