Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

From: Nix
Date: Fri Oct 26 2012 - 08:12:04 EST


On 26 Oct 2012, Theodore Ts'o spake thusly:

> On Thu, Oct 25, 2012 at 08:11:12PM -0400, Ric Wheeler wrote:
>>
>> Sending this just to you two to avoid embarrassing myself if I
>> misread the thread, but....
>>
>> Can we reproduce this with any other hardware RAID card? Or with MD?
>
> There was another user who reported very similar corruption using
> 3.6.2 using USB thumb drive. I can't be certain that it's the same
> bug that's being triggered, but the symptoms were identical.

I now suspect it's the same bug, triggered in a different way, but also
by a block-layer problem -- instead of the block device driver not
blocking while the umount finishes (or throwing some of the data umount
writes away, whichever it is, not yet known), the block device goes away
because someone pulled it out of the USB socket. In any case, it appears
that an ext4 umount being interrupted while data is being written does
bad, bad things to the filesystem.

>> If we cannot reproduce this in other machines, why assume this is an
>> ext4 issue and not a hardware firmware bug?

A tad unlikely. Why would a firmware bug show up only at the instant of
reboot? Why would it show up as a lack of blocking on the kernel side? I
assure you that if you write lots of data to this controller normally,
you will end up blocking :) I can completely believe that it's an arcmsr
driver bug though. If it was an ext4 bug, it would surely be
reproducible in virtualization, or on different hardware, or something
like that.

--
NULL && (void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/