Re: MD-raid broken in 2.6.37.3?
From: Johan Hovold
Date: Thu Mar 10 2011 - 06:26:00 EST
On Thu, Mar 10, 2011 at 09:28:37AM +1100, NeilBrown wrote:
> On Wed, 9 Mar 2011 20:26:42 +0100 Johan Hovold <jhovold@xxxxxxxxx> wrote:
>
> > On Wed, Mar 09, 2011 at 09:02:51PM +1100, NeilBrown wrote:
> > > On Wed, 9 Mar 2011 10:06:22 +0100 Johan Hovold <jhovold@xxxxxxxxx> wrote:
[...]
> > It's the whole array that is missing. The raid-1 arrays appear but the
> > raid-0 does not.
>
> Based on that I am very confident that the problem is not related to
> an md patches in 2.6.37.3 - and your own testing below seems to confirm that.
>
> > > If you still have the boot-log from when you booted 2.6.37.3 (or can
> > > recreated) and can get a similar log for 2.6.37.2, then it might be useful to
> > > compare them.
> >
> > Attaching two boot logs for 2.6.37.3 with /dev/md6 missing, and one for
> > 2.6.37.2.
> >
> > Note that md1, md2, and md3 have v0.90 superblocks, whereas md5 and md6 have
> > v1.20 ones and are assembled later.
> >
> > When /dev/md6 is successfully assembled, through the gentoo init scripts
> > calling "mdadm -As", the log contains:
> >
> > messages.2:Mar 8 20:44:19 xi kernel: md: bind<sda6>
> > messages.2:Mar 8 20:44:19 xi kernel: md: bind<sda5>
> > messages.2:Mar 8 20:44:19 xi kernel: md: bind<sdb5>
> > messages.2:Mar 8 20:44:19 xi kernel: md: bind<sdb6>
>
> This doesn't look like the output that would be generated if
> "mdadm -As" were used.
> in that case you would expect to see the two '5' devices together and the
> two '6' devices together.
> e.g
> sda5
> sdb5
> sda6
> sdb6
>
> This looks more like the result of "mdadm -I" being called on various devices
> as udev discovers them and gives them to mdadm (it could be "mdadm
> --incremental" rather than "-I").
>
> This suggests that there is some race somewhere that is causing either a6 or
> b6 to be missed, either by udev or by mdadm - probably mdadm.
>
> I would suggest that you check if "mdadm -I" is being called by some
> udev rules.d files (/liub/udev/rules.d/*.rules or /etc/udev/rules.d/*.rules)
>
> Then maybe try to enable some udev tracing to get a log of everything it
> does. Then if this is something that you want to pursue, post to
> linux-raid@xxxxxxxxxxxxxxx
> with as many details as you can.
You're right about mdadm --incremental, of course.
Since I'm not able to reproduce it reliably (and adding udev tracing
could probably make it even harder to due to changed timings) and I'm
basically only cold booting on kernel updates these days, I think I'll
have to let this one be for now. At least now I know it's not hurting
my disks.
For reference, I'm using udev-151 and mdadm-3.1.4.
Thanks for you help,
Johan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/