Re: mkfs.ext2 triggerd RAM corruption

From: Bernd Schubert
Date: Sat May 05 2007 - 19:09:27 EST


On Sat, May 05, 2007 at 02:57:35PM -0400, Theodore Tso wrote:
> On Sat, May 05, 2007 at 03:36:37AM +0200, Bernd Schubert wrote:
> > distribution: modified debian sarge, in which aspect is the distribution
> > important for this problem? mkfs2.ext2 is supposed to write to /dev/sdaX
> > and not /dev/rd/0. Stracing it and grepping for open calls shows that
> > only /dev/sdaX is opened in read-write mode.
>
> /dev/rd/0? What's this? Is this the partition where your root
> partition is found? What is it? Is it a ramdisk? Or is it some kind
> of persistent storage device?
>
> If it is a persistant storage device, do the corrupted files stay
> corrupted when you reboot? (If it's a ramdisk which you load, then
> obviously it's getting reloaded on reboot.) You didn't give enough
> information to be sure exactly what's going on.

Sorry, should have expressed myself more clearly, /dev/rd/0 is the
devfs-style name of the first ram disk device (don't like those devfs
names myself, but since I'm rather new in this group I couldn't convice
my boss to switch to short names yet ;) ). However, its only the
devfs-style of udev and not devfs itself.

>
> The next thing to ask is how the files are corrupted. Can you see
> save a copy of the corrupted files to stable storage, so you can see
> *how* they were corrupted. Were large swaths of zeros getting written
> into it?

Yes, many zeros. Binary files, hexdump and diff are here:
http://www.q-leap.com/~bschubert/data-corruption

>
> Next question; if you don't use these mke2fs parameters, can you
> reproduce the corruption?
>
> mkfs.ext2 -j -b 4096 -F -i 4096 -J size=400 -I 512 /dev/sda4
>
> What if you change the it to:
>
> mkfs.ext2 -j -b 4096 /dev/sda4
>
> Do you still see corruption problems?

No, no observable corruption.

>
> > I already tested several partition types, e.g. something like this for a
> > test on sda3
> >
> > beo-05:~# sfdisk -d /dev/sda
> > # partition table of /dev/sda
> > unit: sectors
> >
> > /dev/sda1 : start= 63, size= 4208967, Id=83
> > /dev/sda2 : start= 4209030, size= 4209030, Id=83
> > /dev/sda3 : start= 8418060, size=313251435, Id=83
> > /dev/sda4 : start= 0, size= 0, Id= 0
>
> What if the partition size is smaller; does that make the problem go
> away? If so, can you do a binary search on the partition size where
> the problem appears?

Need to test this thouroughly, but will do it tomorrow, its too late
here for this kind of tests.

>
> And what can you say about the SATA driver you were using; were all of
> the machines that you tested this on using the same SATA controller
> and same driver?

As you can see from my previous reply ;) tested with at least two
different controllers - intel and nvidia (will reboot on the 4th system on Monday to
figure out its hardware, once the corruption happened, the system tend to
stop working).

>
> Obviously if this were a generic kernel problem, we'd been hearing
> about this from a lot more people. So there has to be something
> unique to your setup, and we need to figure out what that might happen
> to be.

I also still have problems to believe its a generic problem...


Thanks for your help,
Bernd

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/