Re: Soft metadata updates paper w/code

Ingo Molnar (mingo@pc7537.hil.siemens.at)
Thu, 24 Jul 1997 10:37:05 +0200 (MET DST)


On Tue, 22 Jul 1997, Theodore Y. Ts'o wrote:

> I've looked at the tech report. It's pretty clever. The basic idea is
> that when you write out a block which has unsatisified dependencies, you
> temporarily *undo* the changes, and then write the block out to disk,
> thus guaranteeing the on-disk copy is consistent.

as far as i've understood it, the 'undo concept' is just the thinking
behind that stuff. When doing it for real they are only applying changes
to a disk block when it can be applied.

so every metadata block has a separate, in-memory 'outstanding
modifications' structure. Which is empty if the block is idle.

> The report talks about undoing changes and then redoing them after the
> write succeeds, so I assume that during the duration of the write,
> access to that disk block is locked out. This could be a contention
> issue for heavily accessed directories (like /tmp) or block bitmaps.

it basically double-buffers changes. There is the 'main copy', which is
used for disk-IO, and there is the 'outstanding modifications graph',
which is in most cases a simple one-entry structure. Two outstanding
modifications can be 'coalesced', the latter one superceding the first
one.

so the _only_ overhead in this method, as far as i've understood, is the
fact that in contention case each modification is done 'twice', first it's
stored in the 'delayed modifications' structure, _then_, when the block is
idle, it's applied. [the paper talks about a syncer daemon that applies
outstanding modifications ... but i think Linux could use a bottom half]

> The alternative approach would be to copy the block to scratch space and
> modify the scratch copy, and then let the device driver write that out
> to disk.

most modifications are 'small', and you can compress the modification
operation itself. This method leads to much less copying overhead. It will
copy just the change itself, not the whole changed block.

> Either way, it requires pretty extensive changes and support all over
> the block device interface, and (perhaps) the generic filesystem layer.
> But the general approach is certainly worth keeping in mind. (Which is
> another way of saying I don't have time to rush out and implement it
> right now; maybe later.)

the block device interface does not have to know about this method at all,
it has to kick a 'block has finished' handler, which is does already,
sortof. If done right, this interface could be built
filesystem-independent, although i guess the first implementation will be
ext2fs based?

but think about it ... Linux could do _theoretically safe_ MSDOS FS
support, whee.

but the hard work is to _understand_ the metadata dependencies actually
... ;)

-- mingo