Re: Linux 2.6.29

From: Neil Brown
Date: Tue Mar 31 2009 - 05:59:03 EST


On Friday March 27, tytso@xxxxxxx wrote:
> On Fri, Mar 27, 2009 at 07:14:26PM +0000, Alan Cox wrote:
> > > Agreed, we need a middle ground. We need a transition path that
> > > recognizes that ext3 won't be the dominant filesystem for Linux in
> > > perpetuity, and that ext3's data=ordered semantics will someday no
> > > longer be a major factor in application design. fbarrier() semantics
> > > might be one approach; there may be others. It's something we need to
> > > figure out.
> >
> > Would making close imply fbarrier() rather than fsync() work for this ?
> > That would give people the ordering they want even if they are less
> > careful but wouldn't give the media error cases - which are less
> > interesting.
>
> The thought that I had was to create a new system call, fbarrier()
> which has the semantics that it will request the filesystem to make
> sure that (at least) changes that have been made data blocks to date
> should be forced out to disk when the next metadata operation is
> committed.

I'm curious about the exact semantics that you are suggesting.
Do you mean that
1/ any data block in any file will be forced out before any metadata
for any file? or
2/ any data block for 'this' file will be forced out before any
metadata for any file? or
3/ any data block for 'this' file will be forced out before any
metadata for this file?

I assume the contents of directories are metadata. If 3 is that case
do we included the metadata of any directories known to contain this
file? Recursively?

I think that if we do introduce new semantics, they should be as weak
as possibly while still achieving the goal, so that fs designers have
as much freedom as possible. It should also be as expressive as
possible so that we don't find we want to extend it later.

What would you think of:
fcntl(fd, F_BEFORE, fd2)

with the semantics that it sets up a transaction dependency between fd
and fd2 and more particularly the operations requested through each
fd.

So if 'fd' is a file, and 'fd2' is the directory holding that file,
then
fcntl(fd, F_BEFORE, fd2)
write(fd, stuff)
renameat(fd2, 'file', fd2, 'newname')

would ensure that the writes to the file were visible on storage
before the rename.
You could also do
fd1 = open("afile", O_RDWR);
fd2 = open("afile", O_RDWR);
fcntl(fd1, F_BEFORE, fd2);

then use write(fd1) to write journal updates to one part of the
(database) file, and write(fd2) to write in-place updates,
and it would just "do the right thing". (You might want to call
fcntl(fd2, F_BEFORE, fd1) as well ... I haven't quite thought through
the details of that yet).

If you gave AT_FDCWD as the fd2 in the fcntl, then operations on fd1
would be ordered before any namespace operations which did not specify a
particular directory, which would be fairly close to option 2 above.

A minimal implementation could fsync fd1 before allowing any operation
on fd2. A more sophisticated implementation could record set up
dependencies in internal data structures and start writeout of the fd1
changes without actually waiting for them to complete.


Just a thought....

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/