Re: Something corrupts raid5 disks slightly during reboot

From: Ville Herva
Date: Wed Jan 14 2004 - 09:49:43 EST


On Fri, Jan 02, 2004 at 09:42:00PM +0200, you [Ville Herva] wrote:
> Summary:
>
> I've been experiencing strange corruption on a raid5 volume for some time.
> The kernel is 2.2.x + RAID-0.90 patch. Fs is ext2 (+e2compr). After
> unmounting the filesystem, I can mount it again without problems. I can also
> raidstop the raid device in between and all is still fine:
>
> > umount /dev/md4; mount /dev/md4
> - no corruption
> > umount /dev/md4; raidstop /dev/md4; raidstart /dev/md4; mount /dev/md4
> - no corruption
>
> But after a reboot, the filesystem is corrupted - few bytes differ in the
> beginning of /dev/md4 between 1k and and 5k.
>
> See the threads
> http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=utf-8&threadm=MMYt.4B2.1%40gated-at.bofh.it&rnum=1&prev=/groups%3Fnum%3D50%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3Dutf-8%26q%3DSomething%2Bcorrupts%2Braid5%2Bdisks%2Bslightly%2Bduring%2Breboot%26sa%3DN%26tab%3Dwg
> http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=utf-8&threadm=MZsH.72R.5%40gated-at.bofh.it&rnum=4&prev=/groups%3Fnum%3D50%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3Dutf-8%26q%3DSomething%2Bcorrupts%2Braid5%2Bdisks%2Bslightly%2Bduring%2Breboot%26sa%3DN%26tab%3Dwg
> for details.
(...)
> I found out that the difference (corruption) is usually on three bytes on
> /dev/hdg, but sometimes on /dev/hdc, too. (/dev/md4 = hdb+hdc+hdg; hdb&hdc
> are on i810, hdg is on hpt370).
>
> First, I did
> umount /dev/md4
> raidstop /dev/md4
> head -c 50k /dev/hdg > /save/hdg
> reboot
>
> To rule out kernel raid autodetect and raid code in general, I
> booted 2.2.25-1-secure with "single init=/bin/bash raid=noautodetect".
> Did
> head -c50k /dev/hdg | cmp -l /save/hdg
> Three bytes differed:
> 4641 0 35
> 4642 0 205
> 4643 0 10
> bytepos after before
> boot boot
>
> wrote the original stuff back:
> dd if=/save/hdg /dev/hdg
> sync
> hdparm -W0 /dev/hdg
> sync
> reboot
>
> Booted 2.2.25-1-secure with "single init=/bin/bash raid=noautodetect"
> again.
> Did
> head -c50k /dev/hdg | cmp -l /save/hdg
> Three same three bytes differed again.
> Wrote the stuff back, sync'ed, did hdparm, and powered off. Still, the the
> bytes differed on next boot.
>
> Then I booted 2.4.21-jam1 with "single init=/bin/bash raid=noautodetect" (I
> happened to have 2.4.21-jam1 compiled with suitable drivers at hand).
> Wrote the same stuff back with dd, synced, turned ide cache off.
> Booted 2.4.21-jam1 with "single init=/bin/bash raid=noautodetect" again.
> Did the diff; the three bytes differed again.
>
> Note that sometimes few bytes on hdc differed, too. Usually it was just the
> three hdg bytes.
>
> So this is not a 2.2 kernel issue. I very much doubt it's a kernel issue at
> all. Unless it is a bug in kernel partition detection that is still present
> in 2.4.x.
>
> I tried to turn off the ide write cache with hdparm -W0, so it shouldn't
> be a write caching issue.
>
> If it's a bios issue, it's really a strange one, since it affects both disks
> on i810 ide and on hpt370. The disks have no partition table, though, which
> _could_ confuse the bios.

Addition:

- I tried booting from 2.6.1 single user mode to 2.6.1 single user
mode (booting with sysrq-b to avoid shutdown process):
-> The corruption on /dev/hdg happens like with 2.2 and 2.4

- I booted from 2.6.1 single user mode to 2.6.1 single user
mode with kexec patch to avoid entering BIOS in between
-> The corruption DOES NOT happen

I'm pretty much out of ideas.


-- v --

v@xxxxxx
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/