Re: fsync on large files

Stephen C. Tweedie (sct@redhat.com)
Wed, 17 Feb 1999 10:57:02 GMT


Hi,

On Mon, 15 Feb 1999 22:03:29 -0500 (EST), "Theodore Y. Ts'o"
<tytso@MIT.EDU> said:

> From: Linus Torvalds <torvalds@transmeta.com>

> I'd much rather just always add it to the "inode dirty list". ...

> This works for ext2, but it doesn't work with filesystems (most notably
> FAT filesystems, and possibly some B-tree based filesystems) where
> metadata might be shared by multiple files, so a block might have to be
> on multiple inode dirty lists.

In that case there is a distinction between file data and metadata,
which belongs to an inode, and tree metadata which belongs to the tree
itself. The way to solve this in a completely general-purpose manner
is to allow non-cyclic buffer dependencies: dirty blocks belonging to
a file should obiously be linked on the inode's dirty list, but tree
metadata should be linked on the leaf node buffers' dirty lists. If
we sync a dirty buffer which has dependencies on parent buffers in the
tree, then we need to sync those buffers too, and so on up to the root
of the tree.

Buffer dependency chains like this are really hard to deal with in the
general case for filesystem write ordering, mainly because inode and
directory bitmap blocks rapidly lead to circular dependencies.
However, for the special case of syncing a specific branch of a tree
structure to disk, it may well be sufficient.

To maintain on-disk consistency of the _entire_ tree, not just for the
branches leading to the file being synced, we really do need a full
write dependency mechanism or a journaling mechanism. With
journaling, the fsync problem becomes much more simple: all you need
is a record of the most recent transaction which affected a particular
inode, and fsync() involves no more than committing that transaction
synchronously.

On Mon, 15 Feb 1999 19:36:30 -0800 (PST), Linus Torvalds
<torvalds@transmeta.com> said:

> On Mon, 15 Feb 1999, Theodore Y. Ts'o wrote:
>>
>> This works for ext2, but it doesn't work with filesystems (most notably
>> FAT filesystems, and possibly some B-tree based filesystems) where
>> metadata might be shared by multiple files, so a block might have to be
>> on multiple inode dirty lists.

> Note that even ext2 has this kind of dirty blocks: the inode blocks
> themselves.

That's really not a problem: the inode cache is a completely separate
cache anyway, and in fsync() we already flush the struct inode to disk
if it is marked dirty.

The other shared metadata for a file, such as bitmap blocks, can
easily be recovered via fsck so don't matter for this purpose.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/