Re: Possible Bug with MD multipath and raid1 on top

From: Lars Marowsky-Bree (lmb@suse.de)
Date: Sat Sep 14 2002 - 18:07:53 EST


On 2002-09-14T20:33:07,
   Oktay Akbal <oktay.akbal@s-tec.de> said:

> I found a very strange effect when using a raid1 on top of multipathing
> with Kernel 2.4.18 (Suse-version of it) with a 2-Port qlogic HBA
> connecting two arrays.

Is this with or without the patch I recently posted to linux-kernel?

If so, please use the patch at http://lars.marowsky-bree.de/dl/md-mp instead,
which is slightly newer and fixes one important (affecting raid0 use) and two
minor issues. Please be aware that you are beta-testing code for the time
being ;-) (Which is highly appreciated!)

> When I now pull out one of the cables two disks are missing and the
> multipath driver correctly uses the second path to the disks and
> continues to work. After plugging out the second cable all drives
> are marked as failed (mdstat), but the raid1 (md2) is still reported
> as functional with one device (md0) missing.

So far this sounds OK. (Even though the updated md-mp patch will _never_ fail
the last path but instead return the error to the layer upwards; this protects
against certain scenarios in 2.4 where a device error can't be distinguished
from a failed path and we don't want that to lead to an inaccessible device)

> All Processes using the raid1-device get stuck and this situation
> does not recover. Even some simple process testing the disk-access
> got stuck (I think ps showed state L<D).

That's not OK, obviously ;-)

I will try to reproduce this on Monday. As I don't have the hardware, but
instead use a loop device (which I can make fail on demand), if I can't
reproduce it, it might in fact be the FC driver which gets stuck somehow.

> Even if I'm quite sure that this is a bug, how should I test disk access
> without ending in "uninterruptible sleep" ?

Uhm, essentially, you should never get stuck in uninterruptible sleep. All
errors should "eventually" time out.

Please compile the kernel with magic-sysrq enabled and check where the
processes are stuck using magic-sysrq t. It might help if you piped the
results through ksymoops.

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
Immortality is an adequate definition of high availability for me.
	--- Gregory F. Pfister

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Sep 15 2002 - 22:00:37 EST