Re: Linux 2.6.29

From: Theodore Tso
Date: Fri Mar 27 2009 - 15:33:18 EST


On Fri, Mar 27, 2009 at 07:14:26PM +0000, Alan Cox wrote:
> > Agreed, we need a middle ground. We need a transition path that
> > recognizes that ext3 won't be the dominant filesystem for Linux in
> > perpetuity, and that ext3's data=ordered semantics will someday no
> > longer be a major factor in application design. fbarrier() semantics
> > might be one approach; there may be others. It's something we need to
> > figure out.
>
> Would making close imply fbarrier() rather than fsync() work for this ?
> That would give people the ordering they want even if they are less
> careful but wouldn't give the media error cases - which are less
> interesting.

The thought that I had was to create a new system call, fbarrier()
which has the semantics that it will request the filesystem to make
sure that (at least) changes that have been made data blocks to date
should be forced out to disk when the next metadata operation is
committed. For ext3 in data=ordered mode, this would be a no-op. For
other filesystems that had fast/efficient fsync()'s, it could simply
be an fsync(). For other filesystems, it could trigger an
asynchronous writeout, if the journal commit will wait for the
writeout to complete. For yet other filesystems, it might set a flag
that will cause the filesystem to start a synchronous writeout of the
file as part of the commit operations. The bottom line was that what
we could *then* tell application programmers to do is
open/write/fbarrier/close/rename. (And for operating systems where
they don't have fbarrier, they can use autoconf magic to replace
fbarrier with fsync.)

We could potentially make close() imply fbarrier(), but there are
plenty of times when that might not be such a great idea. If we do
that, we're back to requiring synchronous data writes for all files on
close(), which might lead to huge latencies, just as ext3's
data=ordered mode did. And in many cases, where the files in
questions can be easily regenerated (such as object files in a kernel
tree build), there really is no reason why it's a good idea to force
the blocks to disk on close(). In the highly unusual case where we
crash in the middle of a kernel build; we can do a "make clean; make"
and regenerate the object files.

The fundamental idea here is not all files need to be forced to disk
on close. Not all files need fsync(), or even fbarrier(). We can
make the system go much more quickly if we can make a distinction
between these two cases. It can also make SSD drives last longer if
we don't force blocks to disk for non-precious files. If people
disagree with this premise, we can go back to something very much like
ext3's data=ordered mode; but then we get *all* of the problems of
ext3's data=ordered mode, including the unexpected filesystem
latencies that Linus and Ingo have been complaining about so much.
The two are very much related.

Anyway, this is just one idea; I'm not claiming that fbarrier() is the
perfect solution --- but it is one I plan to propose at the upcoming
Linux Storage and Filesystem workshop in San Francisco in a week or
so. Maybe someone else will have a better idea.

- Ted


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/