Re: [patch 9/9] mm: fix pagecache write deadlocks

From: Andrew Morton
Date: Tue Feb 06 2007 - 00:30:36 EST

Next message: Andrew Morton: "Re: 2.6.20 PCI Cannot allocate resource region 2"
Previous message: Vivek Goyal: "Re: [PATCH] Compressed ia32 ELF file generation for loading by Gujin 1/3"
In reply to: Nick Piggin: "Re: [patch 9/9] mm: fix pagecache write deadlocks"
Next in thread: Nick Piggin: "Re: [patch 9/9] mm: fix pagecache write deadlocks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, 6 Feb 2007 05:41:46 +0100 Nick Piggin <npiggin@xxxxxxx> wrote:

> On Tue, Feb 06, 2007 at 03:25:49AM +0100, Nick Piggin wrote:
> > On Sun, Feb 04, 2007 at 10:36:20AM -0800, Andrew Morton wrote:
> > > On Sun, 4 Feb 2007 16:10:51 +0100 Nick Piggin <npiggin@xxxxxxx> wrote:
> > >
> > > > They're not likely to hit the deadlocks, either. Probability gets more
> > > > likely after my patch to lock the page in the fault path. But practially,
> > > > we could live without that too, because the data corruption it fixes is
> > > > very rare as well. Which is exactly what we've been doing quite happily
> > > > for most of 2.6, including all distro kernels (I think).
> > >
> > > Thing is, an application which is relying on the contents of that page is
> > > already unreliable (or really peculiar), because it can get indeterminate
> > > results anyway.
> >
> > Not necessarily -- they could read from one part of a page and write to
> > another. I see this as the biggest data corruption problem.

The kernel gets that sort of thing wrong anyway, and always has, because
it uses memcpy()-style copying and not memmove()-style.

I can't imagine what sort of application you're envisaging here. The
problem was only ever observed from userspace by an artificial stress-test
thing.

> And in fact, it is not just transient errors either. This problem can
> add permanent corruption into the pagecache and onto disk, and it doesn't
> even require two processes to race.
>
> After zeroing out the uncopied part of the page, and attempting to loop
> again, we might bail out of the loop for any reason before completing the
> rest of the copy, leaving the pagecache corrupted, which will soon go out
> to disk.
>

Only because ->commit_write() went and incorrectly marked parts of the page
as up-to-date.

Zeroing out the fag end of the copy_from_user() on fault is actually incorrect.
What we _should_ do is to bring those uncopyable, non-uptodate parts of the
page uptodate rather than zeroing them. ->readpage() does that.

So... what happens if we do

lock_page()
prepare_write()
if (copy_from_user_atomic()) {
readpage()
wait_on_page()
lock_page()
}
commit_write()
unlock_page()

- If the page has no buffers then it is either fully uptodate or fully
not uptodate. In the former case, don't call readpage at all. In the
latter case, readpage() is the correct thing to do.

- If the page had buffers, then readpage() won't touch the uptodate ones
and will bring the non-uptodate ones up to date from disk.

Some of the data which we copied from userspace may get overwritten
from backing store, but that's OK.

seems crazy, but it's close. We do have the minor problem that readpage
went and unlocked the page so we need to relock it. I bet there are holes
in there.

Idea #42: after we've locked the pagecache page, do an atomic get_user()
against the source page(s) before attempting the copy_from_user(). If that
faults, don't run prepare_write or anything else: drop the page lock and
try again.

Because

- If the get_user() faults, it might be because the page we're copying
from and to is the same page, and someone went and unmapped it: deadlock.

- If the get_user() doesn't fault, and if we're copying from and to the
same page, we know that we've locked it, so nobody will be able to unmap
it while we're copying from it.

Close, but no cigar! This is still vulnerable to Hugh's ab/ba deadlock
scenario.

btw, to fix the writev() performance problem we may need to go off and run
get_user() against up to 1024 separate user pages before locking the
pagecache page, which sounds fairly idiotic. Are you doing that in the
implemetnations which you've been working on? I forget...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Andrew Morton: "Re: 2.6.20 PCI Cannot allocate resource region 2"
Previous message: Vivek Goyal: "Re: [PATCH] Compressed ia32 ELF file generation for loading by Gujin 1/3"
In reply to: Nick Piggin: "Re: [patch 9/9] mm: fix pagecache write deadlocks"
Next in thread: Nick Piggin: "Re: [patch 9/9] mm: fix pagecache write deadlocks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]