Re: [PATCH v9] mm: support madvise(MADV_FREE)

From: Minchan Kim
Date: Thu Jul 03 2014 - 04:36:14 EST


Hello,

On Thu, Jul 03, 2014 at 10:29:01AM +0200, Martin Schwidefsky wrote:
> On Thu, 3 Jul 2014 16:29:54 +0900
> Minchan Kim <minchan@xxxxxxxxxx> wrote:
>
> > Hello,
> >
> > On Thu, Jul 03, 2014 at 10:03:19AM +0900, Minchan Kim wrote:
> > > Hello,
> > >
> > > On Tue, Jul 01, 2014 at 05:50:58PM +0300, Kirill A. Shutemov wrote:
> > > > On Tue, Jul 01, 2014 at 09:36:15AM +0900, Minchan Kim wrote:
> > > > > + do {
> > > > > + /*
> > > > > + * XXX: We can optimize with supporting Hugepage free
> > > > > + * if the range covers.
> > > > > + */
> > > > > + next = pmd_addr_end(addr, end);
> > > > > + if (pmd_trans_huge(*pmd))
> > > > > + split_huge_page_pmd(vma, addr, pmd);
> > > >
> > > > Could you implement proper THP support before upstreaming the feature?
> > > > It shouldn't be a big deal.
> > >
> > > Okay, Hope to review.
> > >
> > > Thanks for the feedback!
> > >
> >
> > I tried to implement it but had a issue.
> >
> > I need pmd_mkold, pmd_mkclean for MADV_FREE operation and pmd_dirty for
> > page_referenced. When I investigate all of arches supported THP,
> > it's not a big deal but s390 is not sure to me who has no idea of
> > soft tracking of s390 by storage key instead of page table information.
> > Cced s390 maintainer. Hope to help.
>
> Storage key for dirty and referenced tracking is a thing of the past.
> The current code for s390 uses software tracking for dirty and referenced.
> There is one catch though, for ptes the software implementation covers
> dirty and referenced bit but for pmds only referenced bit is available.
> The reason is that there is no free bit left in the pmd entry for the
> software dirty bit.

Thanks for the quick reply.

>
> > So, if there isn't any help from s390, I should introduce
> > HAVE_ARCH_THP_MADVFREE to disable MADV_FREE support of THP in s390 but
> > not want to introduce such new config.
>
> Why is the dirty bit for pmds needed for the MADV_FREE implementation?

MADV_FREE semantic want it.

When madvise syscall is called, VM clears dirty bit of ptes of
the range. If memory pressure happens, VM checks dirty bit of
page table and if it found still "clean", it means it's a
"lazyfree pages" so VM could discard the page instead of swapping out.
Once there was store operation for the page before VM peek a page
to reclaim, dirty bit is set so VM can swap out the page instead of
discarding to keep up-to-date contents.

If it's hard on s390, maybe we could use just reference bit
instead of dirty bit to check recent access but it might change
semantic a bit with other OSes. :(

>
> --
> blue skies,
> Martin.
>
> "Reality continues to ruin my life." - Calvin.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxxx For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/