Re: Ordered block writes

Stephen D. Williams (sdw@lig.net)
Sun, 1 Feb 1998 12:59:18 -0500 (EST)


Maybe an efficient way to do this is to add block dependancies rather
than strict ordering. This would allow the OS to take advantage of
out-of-strict order scatter-gather writes but still maintain important
relationships.

In other words, we add dirty pages to the main dirty list for
sys_sync() etc. to write, but add dependancies from a secondary list.
The method of update could be debated, but could be as simple as
waiting until the main list is purged (might have a problem with new
additions) or simple reference counter updates from the main list back
to the entry in the secondary list. Of course this must support an
arbitrary number of buckets to be general.

When the dependancy of something in a N-ary list goes to zero, it
get's promoted to the main list.

This has the advantage of unordered writes when possible, smaller
changes to the sync logic, and generality. The file system driver
Would have to manage the dependancies by tagging data writes with an
ID that would be used later by dependant blocks.

I like versioning systems (such as database versioning of tuples) and
I think of these ID's as 'generations'. Each commit/'sync' (at the
file system level) generates a new generation ID, after finishing off
the old one.

I am definitely interested in a transaction file system. We need an
absolutely safe file system that fsck's instantly and has easy
mirroring/logging. Enhancing performance with an optimized logical to
physical mapping would be great too. Win98 is using my very old idea
of recording those blocks used during boot and typical application
startup and 'defragmenting' them into the order that they will
typically be read. Linux needs this. When I start Emacs for
instance, there is a knowable sequence of blocks to be read from
various files that could be organized sequentially automatically.

On top of all of this, and possibly in a different project, I want
Posix.4 prioritized async IO supported in the filesystem. This is a
requirement of portable high-speed database packages. (Sybase and
Oracle use vendor specific hacks to do the same thing, usually with
raw partitions.) This also points to having a file system tunable so
that extremely large files (taking all or most of a partition) can be
done without things like triple-indirect block read diversions
throwing the heads around needlessly.

sdw