Re: [BUG] raid5 crash with 2.4.0-test12 [Was: Linux-2.4.0-test12]

From: Linus Torvalds (torvalds@transmeta.com)
Date: Tue Dec 12 2000 - 22:08:09 EST


On Wed, 13 Dec 2000, Neil Brown wrote:
>
> Yes... you are right. Alright, I can't escape it any other way so I
> guess I must admit that it is a raid5 bug.
>
> But how can raid5 be calling b_end_io on a buffer_head that was never
> passed to generic_make_request?
> Answer, it snoops on the buffer cache to try to do complete stripe
> writes.

Ahh, yes. It seems to just do a "get_hash_table()", and put that bh into
the queues. Bad.

> The following patch disabled that code.

If this fix makes the oops go away, then the proper fix for the problem is
not the #if 0, but do add something like

        bh->b_end_io = buffer_end_io_sync;

to just before the "add_stripe_bh(sh, bh, i, WRITE);"

We've already locked the thing, so that should be ok.

I wonder about that "md_test_and_set_bit(BH_Lock ...);" thing there,
though. If the buffer we find was dirty but already locked, we won't be
using that buffer at all (because the md_test_and_set_bit() will fail),
which probably means that the RAID5 checksum won't be right. Hmm..

Why is there an dirty aliased buffer head anyway? That sounds like a
recipe for disaster - maybe we should have synched all the stripe devices
before we set up the raid? Is that a raid5 rebuild issue? What's going on
here?

                Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Fri Dec 15 2000 - 21:00:26 EST