Re: 2.6.19 file content corruption on ext3

From: Linus Torvalds
Date: Mon Dec 18 2006 - 13:06:00 EST



Andrei,
could you try Peter's patch (on top of Andrew's patch - it depends on
it, and wouldn't work on an unmodified -git kernel, but add the WARN_ON()
I mention in this email? You seem to be able to reproduce this easily..
Thanks)

On Mon, 18 Dec 2006, Peter Zijlstra wrote:
>
> This should be safe; page_mkclean walks the rmap and flips the pte's
> under the pte lock and records the dirty state while iterating.
> Concurrent faults will either do set_page_dirty() before we get around
> to doing it or vice versa, but dirty state is not lost.

Ok, I really liked this patch, but the more I thought about it, the more I
started to doubt the reasons for liking it.

I think we have some core fundamental problem here that this patch is
needed at all.

So let's think about this: we apparently have two cases of
"clear_page_dirty()":

- the one that really wants to clear the bit unconditionally (Andrew
calls this the "must_clean_ptes" case, which I personally find to be a
really confusing name, but whatever)

- the other case. The case that doesn't want to really clear the pte
dirty bits.

and I thought your patch made sense, because it saved away the pte state
in the page dirty state, and that matches my mental model, but the more I
think about it, the less sense that whole "the other case" situation makes
AT ALL.

Why does "the other case" exist at all? If you want to clear the dirty
page flag, what is _ever_ the reason for not wanting to drop PTE dirty
information? In other words, what possible reason can there ever be for
saying "I want this page to be clean", while at the same time saying "but
if it was dirty in the page tables, don't forget about that state".

So I absolutely detested Andrew's original patch, because that one made
zero sense at all even from a code standpoint. With your patch on top, it
all suddenly makes sense: at least you don't just leave dirty pages in the
PTE's with a "struct page" that is marked clean, and the end result is
undeniably at least _consistent_.

So Andrew's patch I can't stand, because the whole point of it seems to be
to leave the system in an inconsistent state (dirty in the pte's but
marked "clean"), and if we want to have that state, then we should just
revert _everything_ to the 2.6.18 situation, and not play these games at
all.

Andrew's patch with your patch on top makes me happy, because now we're
at least honoring all the basic rules (we don't get into an inconsistent
state), so on a local level it all makes sense. HOWEVER, I then don't
actually understand how it could ever actually make sense to ask for
"please clean the page, but don't actually clean it".

So _I_ think that we should add a honking huge WARN_ON() for this case.
Ie, do your patch, but instead of re-dirtying the page:

+ if (!must_clean_ptes && cleaned)
+ set_page_dirty(page);

we would do

+ if (!must_clean_ptes && cleaned) {
+ WARN_ON(1);
+ set_page_dirty(page);
+ }

and ask the people who see this problem to see if they get the WARN_ON()
(assuming it _fixes_ their data corruption).

Because whoever calls "clean_dirty_page()" without actually wanting to
clean the PTE's is really a bug: those dirty PTE's had better not exist.

Or maybe the WARN_ON() just points out _why_ somebody would want to do
something this insane. Right now I just can't see why it's a valid thing
to do.

Maybe I'm still confused.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/