Re: ext3-0.0.2e released

From: Daniel Phillips (phillips@innominate.de)
Date: Wed Jul 19 2000 - 03:53:52 EST


On Tue, 18 Jul 2000, Bill Huey wrote:
> > Tux2's phase tree algorithm works almost entirely in cache and is intimately
> > coupled to the buffer cache system. A third metatree is added, to allow
> > filesystem updating to continue without pause while the second tree is commited
> > to disk, eventually replacing the first metatree using the abovementioned
> > atomic write.
>
> This sounds very very close to FFS+softupdates on FreeBSD boxes where there
> is DAG representation of write-orderings and dependencies of various metadata
> chunks. The orderings can be delayed to reduced the number of explict disk writes,
> etc... just like FSes mounted as async and has similar performance to it and
> journaled FSes too. But it maintains FS consistency at all times.

It is not enough merely to control write ordering. You also have to control
the *locations* at which updated blocks are written, to avoid overwriting data
belonging to a previously recorded consistent filesystem state. To
illustrate, suppose you allocate a block, mark it as allocated in the block
bitmap, then you enter its disk address into an inode. Which block do you
write first, the inode block or the bitmap block? If you write the bitmap
block first then crash, you have lost track of a block. If you write the inode
block first then crash, your filesystem has a block marked as free being used
as part of a file. While the first case is clearly better than the second,
neither leaves the filesystem in a state where it does not have to be fscked.

It's possible that Tux2's commit latency could be improved through the use of
a priority graph as used in soft updates. For example, see the discussion of
transaction serving below; using a DAG you could report transaction completion
much sooner than phase tree's whole-tree commit allows. However, I see this as
further work. There is little doubt that the result would be more complex;
throughput would probably not be improved.

> It's very close to what I expect from a journaled FS.

Yes, that's the point. It is supposed to do what a journaled FS does, i.e.,
keep your files safe and make fsck go away, but with less overhead. There are
other advantages over journalling: there are *far* fewer boundary conditions to
deal with. Basically, there is only one ordering constraint to worry about per
phase: write one entire batch of updates before writing the next. Within a
phase the order of writing is completely unconstrained, so an elevator
algorithm is free to choose the shortest path across the disk surface. This
decouples the filesystem from the lowlevel I/O in a very satisfying way.

Note also that most journalling filesystems do not attempt to preserve the
integrity of data within files rigorously because of the associated overhead of
writing every data block twice (roughly speaking). Tux2 does provide an
integrity guarantee for *both* data and metadata.

To be fair, there are two things a journal can do that phase tree cannot: (1)
roll forward and (2) preserve filesystem integrity right up to the last
completed disk write. It's not really clear to me why (1) is useful. But (2)
is important for something like a network transaction server that wants to
report each transaction "complete and safe" absolutely as soon as possible. So
an agressive transaction server would report completion as soon as the journal
entry had been made, allowing the client application to stop waiting and go on
about its business. Phase tree has to wait until the upcoming metaroot write
completes before it can report any transaction complete; this will be any time
from a few tens to a few thousands of disk operations later, depending on how
the phase change heuristics are designed and configured.

This means that journalling is better than phase tree for transaction serving.
In most other applications, IMHO, phase tree will offer higher throughput while
still giving an acceptable transaction latency. I think that is good.

--
Daniel

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Jul 23 2000 - 21:00:12 EST