Re: ext3-0.0.2e released

From: Daniel Phillips (phillips@innominate.de)
Date: Tue Jul 18 2000 - 10:10:18 EST

Next message: Brian Gerst: "Re: Direct control of a device??"
Previous message: Ibrahim Haddad (LMC): "NFS3"
Next in thread: Bill Huey: "Re: ext3-0.0.2e released"
Reply: Bill Huey: "Re: ext3-0.0.2e released"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, 06 Jul 2000, Steve Whitehouse wrote:
> can you explain "phase tree" and/or give a reference ?

I can describe it and give a reference. It's the algorithm used in my Tux2
filesystem that I've been working on since around Christmas, or longer than that
if you count 10 years of thinking about doing it :-)

Phase tree is my name for an algorithm similar to one that has been used in WAFL
and in a OS named Auragen that you can ask Victor Yodaiken about. I
developed the algorithm independently and it was only on reading a posting from
Victor on linux-kernel in from June, 1997, that I realized I wasn't the only
one to have thought of it.

My phase tree algorithm is different enough from the other two that I think it's
fair to call it a new algorithm, or at least a close cousin. I'm writing a
white paper on it for presentation at the ALS this fall. An abstract is
available now. I have a working prototype of Tux2 "with some issues" that I'm
now busily porting form 2.2.12 to 2.4.0.test.

I have attached Victor's original email, which makes very good reading. You can
get the Tux2 abstract by emailing me... (I'm very interested in finding out
exactly who is interested.)

Here is a brief description:

Tux2 is based on Ext2. It is not a journalling filesystem, but it does what a
journalling filesystem does: keep your files safe in the event of a processing
interruption. It does that for both data and metadata and, according to my
early benchmarks, should do it at about the same speed at which a JFS does
metadata-only. We shall see.

Tux2 uses my "phase tree" algorithm (so christened by Alan Cox - I called it
tree/phase but I like his name more). Phase tree imposes a partial ordering on
disk writes to ensure that a filesystem on disk is always updated atomically,
with a single write of the filesystem metaroot. To work properly, the entire
filesystem including all metadata, must be structured as a tree. Ext2 is not
structured as a tree, therefore, the major difference between Ext2 and Tux2 is
that all metadata has been rearranged into a tree.

Once you have the filesystem in the form of a tree you can make a copy of
the metaroot, then for all updates, apply a "branching" algorithm that works
from the updated block towards the metaroot doing a copy-on-write at each node
that needs updating. After some number of updates (the exact number is a
performance-tuning parameter) you store the new metaroot on disk, which gives
you an atomic update. So far this is similar to Auragen and WAFL.

Tux2's phase tree algorithm works almost entirely in cache and is intimately
coupled to the buffer cache system. A third metatree is added, to allow
filesystem updating to continue without pause while the second tree is commited
to disk, eventually replacing the first metatree using the abovementioned
atomic write.

In tree phase terminology, the three trees are called "phases". The three
phases are:
  - recorded phase (the consistent filesystem image currently on disk)
  - recording phase (diffs for a new consistent image currently being written)
  - branching phase (the changing filesystem as applications see it)

Tux2 has its own update daemon that handles its "phase transitions". A phase
transition is the act of commiting a new metaroot wherein the second phase tree
becomes the first, the third becomes the second and a new metaroot is created.
(This is a function analgous to kflushd, though kflushd in its current form
can't possibly know what it would need to know to cause phase transistions at
appropriate times, and in any event, it has no way to initiate one.)

That's basically it. There are some other wrinkles in Tux2 that serve to
flatten the filesystem tree, reduce the number of block writes required and
keep cpu usage to a reasonable level.

As Alan mentioned, there are many interesting things you can do when you have a
filesystem's metadata in the form of a tree. Tux2 doesn't do most of those
things at this point, since its main purpose in life is to demonstrate the
efficacy of the phase tree algorithm and to allow me to do kernel development
without putting my precious files at risk every time I need to reset the system.

Please forgive me for my delay in responding - I've been busy settling into a
new job (which should be easily guessable from my email address).

> >
> > > You can do the same today on Linux with LVM snapshots. They are only
> > > useful for read-only consistency checking because they are read-only.
> > > (so in case of a problem you'll need to umount and rerun fsck on the
> > > normal device)
> >
> > Also there is ext2 based work going on using phase tree rather than journalling
> > which gives you similar journal properties, in future snapshots and also
> > very nice handling of multipath error recovery.
> >
> > Alan

-- Daniel

text/html attachment: safe.commit.html

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/

Next message: Brian Gerst: "Re: Direct control of a device??"
Previous message: Ibrahim Haddad (LMC): "NFS3"
Next in thread: Bill Huey: "Re: ext3-0.0.2e released"
Reply: Bill Huey: "Re: ext3-0.0.2e released"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Sun Jul 23 2000 - 21:00:11 EST