Re: fsync on large files

Alexander Viro (viro@math.psu.edu)
Wed, 17 Feb 1999 08:06:07 -0500 (EST)


On Wed, 17 Feb 1999, Stephen C. Tweedie wrote:

> In that case there is a distinction between file data and metadata,
> which belongs to an inode, and tree metadata which belongs to the tree
> itself. The way to solve this in a completely general-purpose manner
> is to allow non-cyclic buffer dependencies: dirty blocks belonging to
> a file should obiously be linked on the inode's dirty list, but tree
> metadata should be linked on the leaf node buffers' dirty lists. If
> we sync a dirty buffer which has dependencies on parent buffers in the
> tree, then we need to sync those buffers too, and so on up to the root
> of the tree.

Another way is to consider the buffer cache as patch. I mean the following
thing: we have two states of filesystem - on-disk one and (on-disk +
in-core) one. Contents of buffer cache works as patch that should be
applied to the former to transform it into the later. Now, each operation
adds a new patch and currently buffer cache just keeps the union of
everything we submitted. If it would be real patches (i.e. bidirectional
ones) there would be no problems with metadata and data consistency. We
could pick any dirty buffer and look at the set of patches touching it. If
none of those patches depends on anything (except other patches from the
set) we could push the block on disk. If there would be dependencies on
other patches - well, make a copy of block, unroll patches with pending
dependencies on that copy and push the result to disk. When the driver
says that block has hit the disk we can forget about associated patches
(and copy if it had been created).
Indeed, in naive form this scheme would be horrible both wrt speed
and memory footprint. First of all, if patch doesn't depend on anything we
never have to unroll it. Thus for such patches we can simply apply them to
block and just keep the record about said patch. Then, some patches to the
same block can be immediately merged. Then, bdflush can pick the blocks
with no pending dependencies first. In terms of IO this scheme means that
some blocks will have to be written more than once. If you don't have
cyclic dependencies on block level (on *patch* level there will be none in
any case) this will give the same amount of IO as the current scheme (disk
as VM with software pager) does. In case of data blocks almost all patches
will be completely independent from everything else. This scheme was
implemented for metadata and from the benchmarks done by its authors it
gave 5% overhead compared to full async. I think that with few changes it
can be applied both to data and metadata at bearable cost. I am going to
try to implement it and see what kind of overhead will it give. If it will
work normally... <knocking the wood>

> Buffer dependency chains like this are really hard to deal with in the
> general case for filesystem write ordering, mainly because inode and
> directory bitmap blocks rapidly lead to circular dependencies.
> However, for the special case of syncing a specific branch of a tree
> structure to disk, it may well be sufficient.
>
> To maintain on-disk consistency of the _entire_ tree, not just for the
> branches leading to the file being synced, we really do need a full
> write dependency mechanism or a journaling mechanism. With
> journaling, the fsync problem becomes much more simple: all you need
> is a record of the most recent transaction which affected a particular
> inode, and fsync() involves no more than committing that transaction
> synchronously.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/