Re: [patch 2/4] vfs: add set_page_dirty_notag

From: Nick Piggin
Date: Tue Feb 17 2009 - 05:25:05 EST


On Tue, Feb 17, 2009 at 11:05:16AM +0100, Peter Zijlstra wrote:
> On Tue, 2009-02-17 at 10:38 +0100, Nick Piggin wrote:
>
> > It is a great shame that filesystems are not properly notified
> > that a page may become dirty before the actual set_page_dirty
> > event (which is not allowed to fail and is called after the
> > page is already dirty).
>
> Not quite true, for example the set_page_dirty() done by the write fault
> code is done _before_ the page becomes dirty.
>
> This before/after thing was the reason for that horrid file corruption
> bug that dragged on for a few weeks back in .19 (IIRC).

Yeah, there are actually races though. The page can become cleaned
before set_page_dirty is reached, and there are also nasty races with
truncate.


> > This is a big problem I have with fsblock simply in trying to
> > make the memory allocation robust. page_mkwrite unfortunately
> > is racy and I've fixed problems there... the big problem though
> > is get_user_pages. Fixing that properly seems to require fixing
> > callers so it is not really realistic in the short term.
>
> Right, I'm just not sure what we can do, even with a
> prepage_page_dirty() function, what are you going to do, fail the fault?

Oh, for regular page fault functions using page_mkwrite, they
definitely want to fail the fault with a SIGBUS, and actually XFS
already does that (for fsblock robust memory allocations you
would also want to fail OOM on metadata allocation failure). What
is the other option? Silently fail the write?

For XFS purpose (ie. -ENOSPC handling), the current code is reasonable
although there could be some truncate races with block allocation. But
mostly probably works. For something like fsblock it can be much more
common to have the metadata refcount reach 0 and freed before spd is
called. In that case the code actually goes into a bug situation so it
is a bit more critical.

But no that's the "easy" part. The hard part is get_user_pages
because the caller can hold onto the page indefinitely simply with a
refcount, and go along happily dirtying it at any stage (actually
writing to the page memory) before actually calling set_page_dirty.

The "cleanest" way to fix this from VM point of view is probably to
force gup callers to hold the page locked for the duration to
prevent truncation or writeout after the filesystem notification.
Don't know if that would be very popular, however.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/