Re: scary ext2 filesystem question

Dominic Giampaolo (dbg@be.com)
Tue, 05 Jan 1999 20:53:23 PST


On Dec 26'th Alan Cox wrote:

>> I was reading a book on filesystem design, and came across the
>> following scary quote:
>
>Throw the book away, its author is incompetent to make that statement. You
>must therefore assume anythign else the author wrote is probably incorrect
>and unsuitable for learning form
>
As the author of the book in question, I think that Alan Cox is wrong
about this and his follow-on assertions. My intent is to clarify those
issues. I have discussed this with Alan in private e-mail and since he
is unwilling to retract his statements I will explain why I think he is
wrong and everyone can reach their own conclusions.

>> ensure that the file system is consistent, the lack of ordering on
>> operations can lead to confused applications or, even worse, crashing
>> applications because of the inconsistencies in the order of
>> modifications to the file system."
>
>And he fails to understand that writing metadata first is provably the same
>problem
>
Unfortunately whether you write metadata first or not is irrelevant in
the example I gave in the book. This is what I wrote to Alan:

---------------------------------------------------------------------------

The original statement I made was that unexpected results can occur
if there is a crash. This is due to the way in which Linux ext2
handles file system meta data. That statement was based on two
assumptions that I believe to be correct:

- The Linux ext2 file system caches everything including file
system meta-data.

- The Linux block cache does not implement any soft-update
mechanism and flushes blocks as it sees fit (presumably
sorted by block address).

If those two conditions are true then it is possible (although
unlikely) that an application could create two files, fileA and
fileB and even though fileB is created after fileA, only fileB
will exist after a reboot. This can happen if the cache flushes
the meta data blocks related to fileB before the meta data blocks
for fileA. If a power failure occurs after the meta data for fileB
is flushed but before the meta data for fileA then after a reboot,
fsck would properly clean up the file system and fileB would be
created, but not fileA. In some circumstances this is not acceptable.

Also, it is not sufficient that meta-data is written before file
data if there is no ordering within the file system meta-data that
is written. That is, if there is no temporal ordering that forces
fileA's meta data to be written before fileB's, then the problem I
describe can happen. If there is a temporal ordering on file system
meta data then you've pretty much got soft-update. At the time I
wrote that chapter (about 18 months ago) I do not believe there
was any feature like that (nor am I currently aware of any such
feature).

If you can explain how this can not happen I will gladly retract
the statement and have the book corrected for its next printing.
Otherwise I would appreciate it if you would post a retraction of
your comments to the linux kernel mailing list. Your comments were
rather harsh and uncalled for (especially given that you haven't
read the book).

In any case, just so it's clear, my intent was not to trash ext2.
I was merely pointing out an implication of the policy chosen. The
scenario I described is hardly a disaster, it's just something that
people should be aware of. Most applications would be unlikely to
run into such a scenario but for some it is an issue.

---------------------------------------------------------------------------

To be fair, my book does imply that a synchronous write update policy
is somehow "safer" than full caching. That is a debatable point given
some of the evidence presented here and I will make sure to update that
in the next printing.

Next, there seems to be some confusion about the contents of files after
a crash. The contents of a file after a crash are almost certainly going
to be undefined, regardless of whether or not the file system is journaled,
soft-update, or nothing at all. That is simply the way a disk cache works.
If an application needs to depend on the contents of a file between crashes
or power failures then it needs to implement its own consistency checks or
journaling. Simply using the file size and assuming the data is valid is
hardly wise. I make this point quite strongly in the book.

>> _Practical File System Design_, Dominic Giampaolo, p. 36
>
>Thanks. Another book never to buy
>
Actually it is my hope that people would read the book before making
such strong comments in a public forum (especially when they are wrong).

--dominic
[I don't read this list so if you have comments, please reply to me directly]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/