Re: Linux 2.6.29

From: Kyle Moffett
Date: Tue Mar 24 2009 - 14:49:22 EST


On Tue, Mar 24, 2009 at 1:55 PM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Tue, 24 Mar 2009, Theodore Tso wrote:
>> Try ext4, I think you'll like it. Â:-)
>>
>> Failing that, data=writeback for single-user machines is probably your
>> best bet.
>
> Isn't that the same fix? ext4 just defaults to the crappy "writeback"
> behavior, which is insane.
>
> Sure, it makes things _much_ smoother, since now the actual data is no
> longer in the critical path for any journal writes, but anybody who thinks
> that's a solution is just incompetent.
>
> We might as well go back to ext2 then. If your data gets written out long
> after the metadata hit the disk, you are going to hit all kinds of bad
> issues if the machine ever goes down.

Not really...

Regardless of any journalling, a power-fail or a crash is almost
certainly going to cause "data loss" of some variety. We simply
didn't get to sync everything we needed to (otherwise we'd all be
shutting down our computers with the SCRAM switches just for kicks).
The difference is, with ext3/4 (in any journal mode) we guarantee our
metadata is consistent. This means that we won't double-allocate or
leak inodes or blocks, which means that we can safely *write* to the
filesystem as soon as we replay the journal. With ext2 you *CAN'T* do
that at all, as somebody may have allocated an inode but not yet
marked it as in use. The only way to safely figure all that out
without journalling is an fsck run.

That difference between ext4 and ext3-in-writeback-mode is this: If
you get a crash in the narrow window *after* writing initial metadata
and before writing the data, ext4 will give you a zero length file,
whereas ext3-in-writeback-mode will give you a proper-length file
filled with whatever used to be on disk (might be the contents of a
previous /etc/shadow, or maybe somebody's finance files).

In that same situation, ext3 in data-ordered or data-journal mode will
"close" the window by preventing anybody else from making forward
progress until the data and the metadata are both updated. The thing
is, even on ext3 I can get exactly the same kind of behavior with an
appropriately timed "kill -STOP $dumb_program", followed by a power
failure 60 seconds later. It's a relatively obvious race condition...

When you create a file, you can't guarantee that all of that file's
data and metadata has hit disk until after an fsync() call returns.
The only *possible* exceptions are in cases like the
previously-mentioned (and now patched)
open(A)+write(A)+close(A)+rename(A,B), where the
rename-over-existing-file should act as an implicit filesystem
barrier. It should ensure that all writes to the file get flushed
before it is renamed on top of an existing file, simply because so
much UNIX software expects it to act that way.

When you're dealing with programs that simply
open()+ftruncate()+write()+close(), however... there's always going to
be a window in-between the ftruncate and the write where the file *is*
an empty file, and in that case no amount of operating-system-level
cleverness can deal with application-level bugs.

Cheers,
Kyle Moffett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/