Root RAID1 & SCSI crash sanity (includes kernel oops)

Christian Reis (kiko@radiumsystems.com.br)
Thu, 14 Oct 1999 21:06:54 -0200 (EDT)


I have a proto-production Root RAID1 array set up, and it's mostly working
fine. I did have quite some trouble getting the root md set up and
booting, but now that it's done, it's stable.

I usually do a set of tests to know how well my R1 is working - 'hot'
pulling drives data cables - perhaps not the smartest, but the closest I
can get to the actual failure. On the IDE systems I've used, the chipset
usually complains a lot but the systems goes on being quite usable - even
without a raidhotremove.

On this Adaptec 2940 box, however, the SCSI subsystem complains awfully
when I hot-disconnect a drive. The system barely slows to a crawl, and
makes me wonder what a disk failure will do to it. If I raidhotremove the
partitions the system goes back to normal usability, but I then get a nice
kernel oops after a while.

(Can't paste the oops in but it's coming shortly)

Why doesn't the raid array - on the presence of a failed disk - remove the
disk altogether? I actually _thought_ it did, but the SCSI subsystem only
stopped complaining when I raidhotremoved the partitions.

After a reboot everything's fine, of course. Does this mean we need
somebody to come over and reboot the machine if a drive breaks? The
'works partially' aspect rules out a simple watchdog, IMO.

I get the feeling the raid code is at the verge of stability, and that I'm
not supposed to push it too far. When is this feeling going to wear off? I
suspect when we get raid into the main branch.

k

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/