On the subject of the VFS layer (was Re: VFS questions)

J. Sean Connell (ankh@canuck.gen.nz)
Sat, 3 May 1997 19:38:32 +1200 (NZST)

I posted this quite a few months ago now, and was faced with a deafening
wall of silence in return; at least, any replies never made their way back
to me.

Back before we got the brand-new P166 (Linux-based, of course =) news
server on its feet (it's amazing how many problems bad RAM can cause!), we
had a SCSI disk that had a few bad blocks on it.

We got scsiinfo and enabled bad block remapping, and eventually dumped the
disk onto tape and did a controller-level format on the drive, all to no
avail. We subsequently replaced the drive, and <sarcasm>threw the old one
down the lift shaft</sarcasm>.

You see, the box would merrily be scanning through a directory looking for
a file, and it would hit an inode that was on a bad block. Being unable to
read the block, it'd print a nice little panic on the screen, and that
would be it for being able to write to that fs until the box was
rebooted by hitting the reset switch (init would try to sync(), which
would hit the dirty fs, which wouldn't be sync()able because it found a
bad block).

I took a look at the kernel sources (both ext2fs and eventually the VFS
layer), and I was amazed by what I (didn't) find: the low-level disk
manipulation functions (e.g., the ext2fs routine for reading an inode off
the disk), while they themselves check for errors returned by, e.g.,
bread(), have no way to actually *report* this information to higher
layers. And of course, if the bread() fails, the inode retrieval has
failed... but there's no way to tell the caller this fact. At least, not
that I could find.

This, to me, reduces robustness: it doesn't do me very much good if Linux
chokes on one bad block even with the errors=continue mount option.
Solaris and SunOS (I know, not necessarily the /best/ examples) simply
bitch about it on kernel.warn and keep going; we had the same disk in a
SS2+ running SunOS, and I never once found any processes stuck in the D
state, even after seeing hundreds of "Bad block on /dev/sdasomething"
messages in the logs. Heck, sometimes the box would be up another few
weeks without any problems.) I don't actually know what they report to a
user-level process when it tries to do an open() and the fs code finds a
bad block on the way there.

Even though it would require a total overhaul of every filesystem and the
VFS layer, it doesn't seem to me that I could cheerfully use Linux in a
zero-fault-tolerance environment when the fs code simply disowns the
fs when it finds a single measly bad block...

What'd be really cool would be the ability to umount the filesystem after
an ext2 panic to do an e2fsck -c, but since the fs is dirty and can't be
sync()ed (any process which tries to sync a fs that's been panicked about
get stuck in the D state, waiting for sync() to return -- which it never
will), you can't. Which means I get to wait forever for the 20-odd GB
of disk in the news server to fsck.

Fortunately, we don't have any bad disks in there /now/, but it'd still be
nice to see.

Just my two cents.

J. S. Connell      | Systems Adminstrator, ICONZ.  Any opinions stated above
ankh@canuck.gen.nz | are not my employers', not my boyfriends', my God's, my
ankh@iconz.co.nz   | friends', and probably not even my own.
            PGP key at http://www.canuck.gen.nz/~ankh/pgpkey.html