Re: missing madvise functionality

From: Andrew Morton
Date: Tue Apr 03 2007 - 15:59:47 EST


On Tue, 3 Apr 2007 19:28:41 +0200
Andi Kleen <andi@xxxxxxxxxxxxxx> wrote:

> On Tue, Apr 03, 2007 at 10:20:02AM -0700, Ulrich Drepper wrote:
> > Andi Kleen wrote:
> > > Why do you need a lock for that? I don't see any problem with
> > > two threads doing that in parallel. The kernel would
> > > serialize it internally and one would fail, but that shouldn't
> > > be a problem.
> >
> > There is no lock at all at userlevel. I'm talking about locks in the
> > kernel.
>
> mmap_sem? Your new operation wouldn't solve that neither.

It might, a bit. Both mmap() and mprotect() currently take mmap_sem() for
writing. If we're careful, we could probably arrange for MADV_ULRICH to
take it for reading, which will help a little bit, hopefully.

It's a little sad that mprotect() takes mmap_sem for writing, really. I think
the only reason for doing that is because we might do a vma_merge() as a
result. Perhaps this is on the wrong side of the speed/space tradeoff.

otoh, converting a down_write() to a down_read() may well not have much
effect.

Ulrich, could you suggest a little test app which would demonstrate this
behaviour?

> There were some proposals to fix mmap_sem (it's a big issue
> for futexes too) but they're are quite involved.

yup.

Question:

> - if an access to a page in the range happens in the future it must
> succeed. The old page content can be provided or a new, empty page
> can be provided

How important is this "use the old page if it is available" feature? If we
were to simply implement a fast unconditional-free-the-page, so that
subsequent accesses always returned a new, zeroed page, do we expect that
this will be a 90%-good-enough thing, or will it be significantly
inefficient?


If we do implement this retain-the-old-page-if-possible feature, I'm
thinking that we can possibly reuse swapcache concepts. Such a page is
very similar to a clean, unmapped swapcache page, only it doesn't actually
have a swap mapping (well, it might have a swap mapping, in which case we
don't need to do anything at all, except deactivate it).

So perhaps we can do something like chop swapper_space in half: the lower
50% represent offsets which have a swap mapping and the upper 50% are fake
swapcache pages which don't actually consume swapspace. These pages are
unmapped from pagetables, marked clean, added to the fake part of
swapper_space and are deactivated. Teach the low-level swap code to ignore
the request to free physical swapspace when these pages are released.

Or, if that's all too hacky, create a new address_space for these pages and
burn a new page flag. But I suspect we'd end up duplicating so much
swapcache handling that this will end up looking silly.

This would all halve the maximum amount of swap which can be used. iirc
i386 supports 27 bits of swapcache indexing, and 26 bits is 274GB, which
is hopefully enough..

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/