Re: linux-kernel-digest V1 #118

Rob Janssen reading Linux mailinglist (linux@pe1chl.ampr.org)
Tue, 18 Jul 1995 09:18:15 +0200 (MET DST)


According to owner-linux-kernel-digest@vger.rutgers.edu:
> From: Scott Johnson <johnsos@ECE.ORST.EDU>
> Date: Sun, 16 Jul 1995 15:56:39 -0700
> Subject: Re: rmdir hangs on bad ext2 directory (1.2.11)
>
> I've been having a similar problem (rather rarely) on my system as well. I've
> got 1.2.10 running, and my entire Linux filesystem is mounted on a single
> partition (about 200 Mbytes, I've considered repartitioning to give Linux
> more, and OS/2 Warp less space, but thats another story.) The drive is a WD
> 540 "Caviar" drive, (the device is /dev/hdb5, in case that matters). Every
> once in a while, some process will try to access the /var/adm directory, and
> for some reason die (enter uninterruptable sleep). When this happens, the
> HD makes a strange noise, similar to being powered up for the first time. (My
> PC is a desktop, so the HD should not be spinning down for any reason...) It
> may be hardware trouble, it may be something Linux is doing, I dunno. At any
> rate, ANY process which tries to access this directory (/var/adm) gets put to
> sleep. syslogd is usually the first to die, but init soon follows. Any
> process which terminates after, instead of dying gracefully, becomes a
> zombie. And
> shutting down properly with a hung init process is a pain... :) I end up having
> to give the computer the One Fingered Salute (shutdown hangs when trying to
> kill off these hung processes), and pray when I reboot and run fsck.

Resilience to disk errors certainly isn't Linux's best point...
I while ago I had some bad sectors on my SCSI disk (which does not have
automatic re-allocation of those bad sectors), and it was quite difficult
to recover from that. Fortunately I had good backups.
Indeed, as you describe, any process that hits a bad sector is put in
an uninterruptable sleep and cannot be killed. This causes attempts to
shutdown to fail, and hence results in filesystem damage that would not
have been necessary.
I think a process that suffers a disk error should get back an error code
or a signal, so that it can decide itself how to handle the error. Given
the fact that no software seems to bother to check returncodes, it is
probably best to use a signal (in addition to returning an error).

The complicated path of critical errors (via a user process "syslogd")
does not help either. As you describe, that process easily can get stuck
and you have no display of errors anymore.
(it is worst when using X, because you can't see the messages that are
written directly to the console screen until you exit X, which is of course
impossible without locking up the entire system)

It would be nice if you could get the messages printed on a fixed device
directly by the kernel, so that you could send them to the first console
(instead of the current one), to a terminal on a serial port, to a
printer, etc. That would make them less dependent on complex stuff like
syslogd and X.

Rob

-- 
+------------------------------------+--------------------------------------+
| Rob Janssen         rob@knoware.nl | AMPRnet:   rob@pe1chl.ampr.org       |
| e-mail: pe1chl@wab-tis.rabobank.nl | AX.25 BBS: PE1CHL@PI8WNO.#UTR.NLD.EU |
+------------------------------------+--------------------------------------+