dying disk results in unusable system

From: dean gaudet
Date: Thu Aug 18 2005 - 19:31:57 EST

Next message: Mark Lord: "Re: [git patches] ide update"
Previous message: Roman Zippel: "Re: [RFC - 0/9] Generic timekeeping subsystem (v. B5)"
Next in thread: Alan Cox: "Re: dying disk results in unusable system"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

hi...

i've run into this a bunch of times, but decided to look at it more
closely today. i use IDE disks in md raid1 and/or raid5, and when one
disk is dying or dead it tends to make the entire system unusable.

i don't really fault md here, because i'm pretty sure there are some
fundamental problems...

for example, my test setup has a bad disk and a good disk. the bad disk
is sufficiently alive that it's detected at boot time, but pretty much
every write to it results in a disk error.

the good disk houses the entire system, and is mounted noatime,nodiratime
(i don't want writes until i trigger them).

after boot i do this:

dd if=/dev/zero of=/dev/baddisk bs=1M &

wait 10 or 20 seconds, monitoring vmstat output until almost all memory is
taken by buffers. then i simulate something like queueing a mail message
(on the good disk):

perl -e 'open(F,">/var/tmp/foo"); print F "x" x 16384; fsync(F); close(F);'

this will essentially never complete. it might complete, but it'll take a
lot longer than my patience -- and regardless it's long enough to be bad
for any real system.

in order to eliminate one variable i've split the bad disk and good disk
onto different controllers -- bad on a promise ultra100, and good on a
3ware 7504.

the current state of the system is like so:

# vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
...
0 6 0 181900 1714472 50148 0 0 0 0 1001 88 0 0 0 100
...

# ps auxww | grep D
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 144 0.0 0.0 0 0 ? D 16:50 0:00 [pdflush]
root 145 0.0 0.0 0 0 ? D 16:50 0:00 [pdflush]
root 3326 0.0 0.0 6768 1044 ? Ds 16:50 0:00 /sbin/syslogd
ntp 3429 0.0 0.2 15316 5144 ? DLs 16:50 0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid -u 104:104
root 3613 1.3 0.0 3976 496 pts/0 DN 17:03 0:02 dd if /dev/zero of /dev/hde bs 1M
root 3654 0.0 0.0 12460 1040 ? D 17:05 0:00 /USR/SBIN/CRON
root 3656 0.0 0.0 10424 1564 pts/1 D+ 17:05 0:00 perl -e open(F,">/var/tmp/foo"); print F "x" x 16384; fsync(F); close(F);

my theory is that pretty much all of memory is dirty buffers which can
never be flushed... and perhaps pdflush is also stuck trying to flush
those unflushable buffers, because it won't skip ahead in the queue of
dirty buffers.

generally at this failure point it becomes impossible to even log into the
system because of the handful of writes required... and clean reboots are
hopeless because sync will never complete.

the kernel is debian 2.6.12-1-amd64-k8-smp, but it's a problem i've
experienced many times over the years... if there are other patches or
kernels i should try, let me know.

i'm hoping this might cause some discussion... i see a few possibilities:

- if there were some way to explicitly drop dirty buffers then md raid[156]
could drop dirty buffers for the first disk to fail in a raid set -- this
would dramatically increase survivability for raid[156] setups when a
disk fails in this manner. (dropping dirty buffers in multidisk failures
might not be desirable...)

- something like blockdev(8) could be used to manually drop dirty buffers
in other write failure situations. i.e. without md i think it should
possibly be up to the admin to decide to drop dirty buffers.

- if pdflush really is stuck doing only bad writes then maybe it should
have some way to deprioritize writes to devices which have had write
failures recently.

- when a disk has exhibited a write error then be more aggressive about
blocking processes writing to the disk -- i.e. behave as if
/proc/sys/vm/dirty_ratio is a lot lower for that device. i'm skeptical
this would halt my dd process fast enough though -- because it barely
takes it any time to chew up all of memory with dirty buffers... i'm
sure the first error report comes too late.

- send dirty buffers for bad disks to swap?? this is at least a safe thing
to do even on non-raided systems... and it gets past the memory choke.

thoughts? i'm going to hang onto this bad disk so i can try out patches...
if folks point me in the right direction(s) i could even work on fixes.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Mark Lord: "Re: [git patches] ide update"
Previous message: Roman Zippel: "Re: [RFC - 0/9] Generic timekeeping subsystem (v. B5)"
Next in thread: Alan Cox: "Re: dying disk results in unusable system"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]