Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)
From: Minchan Kim
Date: Wed Nov 04 2015 - 20:48:53 EST
On Wed, Nov 04, 2015 at 05:29:57PM -0800, Andy Lutomirski wrote:
> On Wed, Nov 4, 2015 at 4:56 PM, Minchan Kim <minchan@xxxxxxxxxx> wrote:
> > On Wed, Nov 04, 2015 at 04:42:37PM -0800, Andy Lutomirski wrote:
> >> On Wed, Nov 4, 2015 at 4:13 PM, Minchan Kim <minchan@xxxxxxxxxx> wrote:
> >> > On Tue, Nov 03, 2015 at 07:41:35PM -0800, Andy Lutomirski wrote:
> >> >> On Nov 3, 2015 5:30 PM, "Minchan Kim" <minchan@xxxxxxxxxx> wrote:
> >> >> >
> >> >> > Linux doesn't have an ability to free pages lazy while other OS already
> >> >> > have been supported that named by madvise(MADV_FREE).
> >> >> >
> >> >> > The gain is clear that kernel can discard freed pages rather than swapping
> >> >> > out or OOM if memory pressure happens.
> >> >> >
> >> >> > Without memory pressure, freed pages would be reused by userspace without
> >> >> > another additional overhead(ex, page fault + allocation + zeroing).
> >> >> >
> >> >>
> >> >> [...]
> >> >>
> >> >> >
> >> >> > How it works:
> >> >> >
> >> >> > When madvise syscall is called, VM clears dirty bit of ptes of the range.
> >> >> > If memory pressure happens, VM checks dirty bit of page table and if it
> >> >> > found still "clean", it means it's a "lazyfree pages" so VM could discard
> >> >> > the page instead of swapping out. Once there was store operation for the
> >> >> > page before VM peek a page to reclaim, dirty bit is set so VM can swap out
> >> >> > the page instead of discarding.
> >> >>
> >> >> What happens if you MADV_FREE something that's MAP_SHARED or isn't
> >> >> ordinary anonymous memory? There's a long history of MADV_DONTNEED on
> >> >> such mappings causing exploitable problems, and I think it would be
> >> >> nice if MADV_FREE were obviously safe.
> >> >
> >> > It filter out VM_LOCKED|VM_HUGETLB|VM_PFNMAP and file-backed vma and MAP_SHARED
> >> > with vma_is_anonymous.
> >> >
> >> >>
> >> >> Does this set the write protect bit?
> >> >
> >> > No.
> >> >
> >> >>
> >> >> What happens on architectures without hardware dirty tracking? For
> >> >> that matter, even on architecture with hardware dirty tracking, what
> >> >> happens in multithreaded processes that have the dirty TLB state
> >> >> cached in a different CPU's TLB?
> >> >>
> >> >> Using the dirty bit for these semantics scares me. This API creates a
> >> >> page that can have visible nonzero contents and then can
> >> >> asynchronously and magically zero itself thereafter. That makes me
> >> >> nervous. Could we use the accessed bit instead? Then the observable
> >> >
> >> > Access bit is used by aging algorithm for reclaim. In addition,
> >> > we have supported clear_refs feacture.
> >> > IOW, it could be reset anytime so it's hard to use marker for
> >> > lazy freeing at the moment.
> >> >
> >>
> >> That's unfortunate. I think that the ABI would be much nicer if it
> >> used the accessed bit.
> >>
> >> In any case, shouldn't the aging algorithm be irrelevant here? A
> >> MADV_FREE page that isn't accessed can be discarded, whereas we could
> >> hopefully just say that a MADV_FREE page that is accessed gets moved
> >> to whatever list holds recently accessed pages and also stops being a
> >> candidate for discarding due to MADV_FREE?
> >
> > I meant if we use access bit as indicator for lazy-freeing page,
> > we could discard valid page which is never hinted by MADV_FREE but
> > just doesn't mark access bit in page table by aging algorithm.
>
> Oh, is the rule that the anonymous pages that are clean are discarded
> instead of swapped out? That is, does your patch set detect that an
The page swapped-in after swapped-out has clean pte and swap device
has valid data if the page isn't touch so VM discards the page rather
than swapout. Of course, pte should point out the swap slot.
If VM decide to remove the page from swap slot, it should be marked
PG_dirty.
> anonymous page can be discarded if it's clean and that the lack of a
> dirty bit is the only indication that the page has been hit with
> MADV_FREE?
No dirty bit, exactly speaking, PG_Dirty
because the page I mentioned above has clean pte but will have PG_dirty.
>
> If so, that seems potentially error prone -- I had assumed that pages
> that were swapped in but not written since swap-in would also be
> clean, and I don't see how you distinguish them.
I hope above will answer.
>
> --Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/