Re: [RFC v2] Support volatile range for anon vma

From: Paul Turner
Date: Wed Oct 31 2012 - 17:59:28 EST


On Wed, Oct 31, 2012 at 2:35 PM, Andrew Morton
<akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Tue, 30 Oct 2012 10:29:54 +0900
> Minchan Kim <minchan@xxxxxxxxxx> wrote:
>
> > This patch introudces new madvise behavior MADV_VOLATILE and
> > MADV_NOVOLATILE for anonymous pages. It's different with
> > John Stultz's version which considers only tmpfs while this patch
> > considers only anonymous pages so this cannot cover John's one.
> > If below idea is proved as reasonable, I hope we can unify both
> > concepts by madvise/fadvise.
> >
> > Rationale is following as.
> > Many allocators call munmap(2) when user call free(3) if ptr is
> > in mmaped area. But munmap isn't cheap because it have to clean up
> > all pte entries and unlinking a vma so overhead would be increased
> > linearly by mmaped area's size.
>
> Presumably the userspace allocator will internally manage memory in
> large chunks, so the munmap() call frequency will be much lower than
> the free() call frequency. So the performance gains from this change
> might be very small.

I don't think I strictly understand the motivation from a
malloc-standpoint here.

These days we (tcmalloc) use madvise(..., MADV_DONTNEED) when we want
to perform discards on Linux. For any reasonable allocator (short
of binding malloc --> mmap, free --> unmap) this seems a better
choice.

Note also from a performance stand-point I doubt any allocator (which
case about performance) is going to want to pay the cost of even a
null syscall about typical malloc/free usage (consider: a tcmalloc
malloc/free pairis currently <20ns). Given then that this cost is
amortized once you start doing discards on larger blocks MADV_DONTNEED
seems a preferable interface:
- You don't need to reconstruct an arena when you do want to allocate
since there's no munmap/mmap for the region to change about
- There are no syscalls involved in later reallocating the block.

The only real additional cost is address-space. Are you strongly
concerned about the 32-bit case?

>
> The whole point of the patch is to improve performance, but we have no
> evidence that it was successful in doing that! I do think we'll need
> good quantitative testing results before proceeding with such a patch,
> please.
>
> Also, it is very desirable that we involve the relevant userspace
> (glibc, etc) developers in this. And I understand that the google
> tcmalloc project will probably have interest in this - I've cc'ed
> various people@google in the hope that they can provide input (please).
>
> Also, it is a userspace API change. Please cc mtk.manpages@xxxxxxxxxx
>
> Also, I assume that you have userspace test code. At some stage,
> please consider adding a case to tools/testing/selftests. Such a test
> would require to creation of memory pressure, which is rather contrary
> to the selftests' current philosopy of being a bunch of short-running
> little tests. Perhaps you can come up with something. But I suggest
> that such work be done later, once it becomes clearer that this code is
> actually headed into the kernel.
>
> > Allocator should call madvise(MADV_NOVOLATILE) before reusing for
> > allocating that area to user. Otherwise, accessing of volatile range
> > will meet SIGBUS error.
>
> Well, why? It would be easy enough for the fault handler to give
> userspace a new, zeroed page at that address.

Note: MADV_DONTNEED already has this (nice) property.

>
> Or we could simply leave the old page in place at that address. If the
> page gets touched, we clear MADV_NOVOLATILE on its VMA and give the
> page (or all the not-yet-reclaimed pages) back to userspace at their
> old addresses.
>
> Various options suggest themselves here. You've chosen one of them but
> I would like to see a pretty exhaustive description of the reasoning
> behind that decision.
>
> Also, I wonder about the interaction with other vma manipulation
> operations. For example, can a VMA get split when in the MADV_VOLATILE
> state? If so, what happens?
>
> Also, I see no reason why the code shouldn't work OK with nonlinear VMAs,
> but I bet this wasn't tested ;)
>
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -86,6 +86,22 @@ static long madvise_behavior(struct vm_area_struct * vma,
> > if (error)
> > goto out;
> > break;
> > + case MADV_VOLATILE:
> > + if (vma->vm_flags & VM_LOCKED) {
> > + error = -EINVAL;
> > + goto out;
> > + }
> > + new_flags |= VM_VOLATILE;
> > + vma->purged = false;
> > + break;
> > + case MADV_NOVOLATILE:
> > + if (!(vma->vm_flags & VM_VOLATILE)) {
> > + error = -EINVAL;
> > + goto out;
>
> I wonder if this really should return an error. Other madvise()
> options don't do this, and running MADV_NOVOLATILE against a
> not-volatile area seems pretty benign and has clearly defined before-
> and after- states.
>
> > + }
> > +
> > + new_flags &= ~VM_VOLATILE;
> > + break;
> > }
> >
> > if (new_flags == vma->vm_flags) {
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/