Re: 2.2.0 wishlist

Stephen C. Tweedie (sct@dcs.ed.ac.uk)
Mon, 17 Jun 1996 09:28:47 +0100


Hi,

On Mon, 17 Jun 1996 03:10:21 +0200 (MET DST), Marek Michalkiewicz
<marekm@i17linuxb.ists.pwr.wroc.pl> said:

> Stephen C. Tweedie:
>> Ordered writes have major problems. You want a list? OK: you get bad
>> cyclic dependencies (especially between inode blocks and bitmaps,
>> since multiple inodes get stored in a single block); if your data is

> I think (but could be wrong) that bitmaps can be recovered by e2fsck
> from the remaining information, so bitmaps could still be completely
> asynchronous as long as any other writes are ordered.

It depends on the semantics you adopt. Normally, to get decent
semantics from ordered writes, you assume that where there is
replication in the filesystem data, as with bitmaps representing the
inode block allocations, one or other is definitive. It is common to
assume that the bitmaps are definitive in these systems.

The reason is that although we want to submit changes to an inode to
the journal rapidly, we don't necessarily want to journal the file
data content itself --- that's too slow. So, what we do is write out
the file data asynchronously, but we only commit that data once it's
out. We do this by not marking the blocks of data as being used on
the disk until after the data has been written, so we never end up
with old data appearing randomly in a file after a crash; on recovery,
if the bitmap says that some block is unused, we assume that although
it did get allocated to a file, the data never actually got written
before the crash.

>> always being written, you NEVER get to update the inode; and Unixware

> Hmm, this looks like a real problem...

It is. I heard an amusing story about a large database which was
protected by this form of write ordering. They discovered that when
it crashed, after several weeks use, the database had never been idle,
and so the data on disk was never _fully_ uptodate with the data in
memory, so ordered writes told the kernel that it was never quite time
yet to update the inode on disk... and so on reboot they were given
back a perfectly consistent, empty, database file. Guaranteed no bad
data in _that_ file.

>> have got a broad-ranging patent for ordered filesystem writes anyway.
>> <sigh>.

> Maybe it's time for:

> Exclude options which may not be used in the US? (CONFIG_USA) [N/y/?]

Yup. It's not just filesystems, either; US software legislation (or
rather, the US's interpretation of its own legislation...) is going
to be an increasingly difficult problem for US linux users. It's not
going to hold the rest of the world back.

Cheers,
Stephen.

--
Stephen Tweedie <sct@dcs.ed.ac.uk>
Department of Computer Science, Edinburgh University, Scotland.