[PATCH-2.4] raid5 oopses on 2nd failed device

From: Alex Tomas (alexey@technomagesinc.com)
Date: Tue Apr 15 2003 - 05:42:29 EST


hi!

it looks like raid5 is buggy. as 2nd device is failed, I got oops.
it's BUG at raid5.c:212. this is because sh->written[N] isn't NULL.
raid5d thread try to handle this stripe, but following condition
skip handling:

        /* might be able to return some write requests if the parity block
         * is safe, or on a failed drive
         */
        bh = sh->bh_cache[sh->pd_idx];
        if ( written &&
             ( (conf->disks[sh->pd_idx].operational && !buffer_locked(bh) && buffer_uptodate(bh))
               || (failed == 1 && failed_num == sh->pd_idx))
            ) {

I suggest to fail requests which can't be returned as successful.

with best regards

diff -puNr linux/drivers/md/raid5.c edited/drivers/md/raid5.c
--- linux/drivers/md/raid5.c Wed Jan 15 20:18:37 2003
+++ edited/drivers/md/raid5.c Tue Apr 15 12:59:24 2003
@@ -938,7 +938,19 @@ static void handle_stripe(struct stripe_
                     }
                 }
         }
-
+
+ /* if already written requests can't be returned as successful fail them */
+ if (failed > 1 && written) {
+ for (i=disks; i--; ) {
+ if (sh->bh_written[i]) written--;
+ while ((bh = sh->bh_written[i])) {
+ sh->bh_written[i] = bh->b_reqnext;
+ bh->b_reqnext = return_fail;
+ return_fail = bh;
+ }
+ }
+ }
+
         /* Now we might consider reading some blocks, either to check/generate
          * parity, or to satisfy requests
          */

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue Apr 15 2003 - 22:00:35 EST