Re: [PATCH 0/4] (RESEND) ext3 barrier changes
From: Chris Mason
Date: Sat May 17 2008 - 20:59:22 EST
On Friday 16 May 2008, Andrew Morton wrote:
> On Fri, 16 May 2008 20:20:30 -0400 Theodore Tso <tytso@xxxxxxx> wrote:
> > On Fri, May 16, 2008 at 11:53:04PM +0100, Jamie Lokier wrote:
> > > > > If you just want to test the block I/O layer and drive itself,
> > > > > don't use the filesystem, but write a program which just access the
> > > > > block device, continuously writing with/without barriers every so
> > > > > often, and after power cycle read back to see what was and wasn't
> > > > > written.
> > > >
> > > > Well, I think it is worth testing through the filesystem, different
> > > > journaling mechanisms will probably react^wcorrupt in different ways.
> > >
> > > I agree, but intentional tests on the block device will show the
> > > drives characteristcs on power failure much sooner and more
> > > consistently. Then you can concentrate on the worst drivers :-)
> > I suspect the real reason why we get away with it so much with ext3 is
> > that the journal is usually contiguous on disk, hence, when you write
> > to the journal, it's highly unlikely that commit block will be written
> > and the blocks before the commit block have not.
> yup. Plus with a commit only happening once per few seconds, the time
> window for a corrupting power outage is really really small, in
> relative terms. All these improbabilities multiply.
Well, the barriers happen like so (even if we actually only do one barrier in
submit_bh, it turns into two)
write log blocks
write commit block
write metadata blocks
I'd agree with Ted, there's a fairly small chance of things get reordered
around flush #1. flush #2 is likely to have lots of reordering though. It
should be easy to create situations where the metadata for a transaction is
written before the log blocks ever see the disk.
EMC did a ton of automated testing around this when Jens and I did the initial
barrier implementations, and they were able to trigger corruptions in fsync
heavy workloads with randomized power offs. I'll dig up the workload they
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/