Re: Nature of ext4 corruption fixed by recent patch?

From: Josh Triplett
Date: Tue May 19 2015 - 12:37:52 EST


On Tue, May 19, 2015 at 09:40:05AM -0400, Theodore Ts'o wrote:
> On Mon, May 18, 2015 at 03:58:24PM -0700, josh@xxxxxxxxxxxxxxxx wrote:
> >
> > I recently had my server's filesystem implode, and I'm currently in the
> > process of cleaning it up. It had widespread corruption in files and
> > directories scattered across the filesystem, though all vaguely recently
> > changed. Directories appeared corrupted or truncated, various files
> > showed up as piles of NULs, and 5000+ files and directories ended up in
> > lost+found. I observed this corruption shortly after a reboot into
> > 4.0.2 (from a previous kernel of 3.16), with ext4 noticing an
> > inconsistency and mounting the filesystem read-only. The underling
> > disks had no errors.
> >
> > Reading about the corruption issue fixed by
> > d2dc317d564a46dfc683978a2e5a4f91434e9711 ("ext4: fix data corruption
> > caused by unwritten and delayed extents"), it sounds plausible. Can
> > that strike both file data and directory data, assuming all of that data
> > ended up grouped with a delayed extent? Would that bug manifest as
> > corrupted directories and files filled with NULs? The system is a
> > 72-way server on which I was doing piles of parallel git pulls and
> > builds, so hitting a race seems plausible.
>
> Unfortunately, I don't think you can blame all of your problems on the
> bug fixed by this particular bug. First of all, it doesn't apply to
> directories at all; secondly, it's been around for a long time. I'd
> have to check and see whether or not 3.16 had the problem, but it
> wouldn't surprise me at all. Finally, git pulls and builds are not
> at all likely to hit the problem.
>
> It requires the combination of (a) writing to a portion of a file that
> was not previously allocated using buffered I/O, (b) an fallocate of a
> region of the file which is a superset of region written in (a) before
> it has chance to be written to disk, (c) waiting for the file data in
> (a) to be written out to disk (either via fsync or via the writeback
> daemons), and then (d) before the extent status cache gets pushed out
> of memory, another random write to a portion of the file covered by
> (a) -- in which case that specific portion of (a) could be replaced by
> all zeros.
>
> Even most database or torrent downloads are not likely to hit this
> pattern, since it requires an fallocate of a previous previously (and
> very recently) allocated region of a file using a buffered write.
> Torrent downloads will tend to fallocate the whole file in advance,
> and while Oracle or DB2 might intermix writes and fallocates, they
> don't fallocate previously written regions of the file, and they use
> direct I/O in any case.

Ah, thanks for the clarification. :(

In particular, I didn't realize this was *only* the data of the
delayed-extent-based files. The bug here seems to have struck various
recently-written files and directories. (Recent in days, not seconds,
as far as I can tell; and it isn't universal based on age.) The initial
symptom was ext4 noticing that a directory was corrupt (truncated, IIRC)
and immediately marking the whole filesystem read-only.

> So it's pretty hard to hit this bug by accident, unless you happen to
> be using fsx, and even then, the only files that would get corrupted
> would be the files being written using fsx. So I'm afraid you'll have
> to look farther afield, and consider other bugs as well as potential
> hardware problems before trusting the system again.

I'm quite skeptical of hardware problems. The system is a few months
old, well past infant-mortality and too young for burnout. And I've
tested the disks carefully.

Are there any other known bugs that seem likely to fit the symptoms and
circumstances?

Note that since I saw this after rebooting from 3.16 into 4.0.2, I don't
know whether the corruption was more likely caused by 3.16 or 4.0.2.

> P.S. It's bugs like these which is why I'm always amused by people
> who think that just because a file system is safely being used by
> their developers, that it's safe to throw production workloads on
> them.

Heh. Yeah, I like exciting new software in most areas, but not in
filesystems. In filesystems I prefer boring. :)

> These sorts of subtle data corruptors tend to be highly timing
> depend, and very hard to find. Sometimes these bugs can hang around
> for years before they are found and fixed. The flip side is that
> fortunately, they tend to strike very rarely.

...lucky me.

> It's also why I'm very
> grateful for developers like Jan and Lukas. :-)

Indeed.

- Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/