Re: [PATCH 1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise_free

From: Barry Song
Date: Tue Feb 27 2024 - 02:54:57 EST


On Tue, Feb 27, 2024 at 8:42 PM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote:
>
>
>
> On 2/27/24 15:21, Barry Song wrote:
> > On Tue, Feb 27, 2024 at 8:11 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
> >>
> >> On Tue, Feb 27, 2024 at 8:02 PM Yin Fengwei <fengwei.yin@intelcom> wrote:
> >>>
> >>>
> >>>
> >>> On 2/27/24 14:40, Barry Song wrote:
> >>>> On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 2/27/24 10:17, Barry Song wrote:
> >>>>>>> Like if we hit folio which is partially mapped to the range, don't split it but
> >>>>>>> just unmap the mapping part from the range. Let page reclaim decide whether
> >>>>>>> split the large folio or not (If it's not mapped to any other range,it will be
> >>>>>>> freed as whole large folio. If part of it still mapped to other range,page reclaim
> >>>>>>> can decide whether to split it or ignore it for current reclaim cycle).
> >>>>>> Yes, we can. but we still have to play the ptes check game to avoid adding
> >>>>>> folios multiple times to reclaim the list.
> >>>>>>
> >>>>>> I don't see too much difference between splitting in madvise and splitting
> >>>>>> in vmscan. as our real purpose is avoiding splitting entirely mapped
> >>>>>> large folios. for partial mapped large folios, if we split in madvise, then
> >>>>>> we don't need to play the game of skipping folios while iterating PTEs.
> >>>>>> if we don't split in madvise, we have to make sure the large folio is only
> >>>>>> added in reclaimed list one time by checking if PTEs belong to the
> >>>>>> previous added folio.
> >>>>>
> >>>>> If the partial mapped large folio is unmapped from the range, the related PTE
> >>>>> become none. How could the folio be added to reclaimed list multiple times?
> >>>>
> >>>> in case we have 16 PTEs in a large folio.
> >>>> PTE0 present
> >>>> PTE1 present
> >>>> PTE2 present
> >>>> PTE3 none
> >>>> PTE4 present
> >>>> PTE5 none
> >>>> PTE6 present
> >>>> ....
> >>>> the current code is scanning PTE one by one.
> >>>> while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6...
> >>> No. Before detect the folio is fully mapped to the range, we can't add folio
> >>> to reclaim list because the partial mapped folio shouldn't be added. We can
> >>> only scan PTE15 and know it's fully mapped.
> >>
> >> you never know PTE15 is the last one mapping to the large folio, PTE15 can
> >> be mapping to a completely different folio with PTE0.
> >>
> >>>
> >>> So, when scanning PTE0, we will not add folio. Then when hit PTE3, we know
> >>> this is a partial mapped large folio. We will unmap it. Then all 16 PTEs
> >>> become none.
> >>
> >> I don't understand why all 16PTEs become none as we set PTEs to none.
> >> we set PTEs to swap entries till try_to_unmap_one called by vmscan.
> >>
> >>>
> >>> If the large folio is fully mapped, the folio will be added to reclaim list
> >>> after scan PTE15 and know it's fully mapped.
> >>
> >> our approach is calling pte_batch_pte while meeting the first pte, if
> >> pte_batch_pte = 16,
> >> then we add this folio to reclaim_list and skip the left 15 PTEs.
> >
> > Let's compare two different implementation, for partial mapped large folio
> > with 8 PTEs as below,
> >
> > PTE0 present for large folio1
> > PTE1 present for large folio1
> > PTE2 present for another folio2
> > PTE3 present for another folio3
> > PTE4 present for large folio1
> > PTE5 present for large folio1
> > PTE6 present for another folio4
> > PTE7 present for another folio5
> >
> > If we don't split in madvise(depend on vmscan to split after adding
> > folio1), we will have
> Let me clarify something here:
>
> I prefer that we don't split large folio here. Instead, we unmap the
> large folio from this VMA range (I think you missed the unmap operation
> I mentioned).

I don't understand why we unmap as this is a MADV_PAGEOUT not
an unmap. unmapping totally changes the semantics. Would you like
to show pseudo code?

for MADV_PAGEOUT on swap-out, the last step is writing swap entries
to replace PTEs which are present. I don't understand how an unmap
can be involved in this process.

>
> The intention is trying best to avoid splitting the large folio. If
> the folio is only partially mapped to this VMA range, it's likely it
> will be reclaimed as whole large folio. Which brings benefit for lru
> and zone lock contention comparing to splitting large folio.

which also brings negative side effects such as redundant I/O.
For example, if you have only one subpage left in a large folio,
pageout will still write nr_pages subpages into swap, then immediately
free them in swap.

>
> The thing I am not sure is unmapping from specific VMA range is not
> available and whether it's worthy to add it.

I think we might have the possibility to have some complex code to
add folio1, folio2, folio3, folio4 and folio5 in the above example into
reclaim_list while avoiding splitting folio1. but i really don't understand
how unmap will work.

>
> > to make sure folio1, folio2, folio3, folio4, folio5 are added to
> > reclaim_list by doing a complex
> > game while scanning these 8 PTEs.
> >
> > if we split in madvise, they become:
> >
> > PTE0 present for large folioA - splitted from folio 1
> > PTE1 present for large folioB - splitted from folio 1
> > PTE2 present for another folio2
> > PTE3 present for another folio3
> > PTE4 present for large folioC - splitted from folio 1
> > PTE5 present for large folioD - splitted from folio 1
> > PTE6 present for another folio4
> > PTE7 present for another folio5
> >
> > we simply add the above 8 folios into reclaim_list one by one.
> >
> > I would vote for splitting for partial mapped large folio in madvise.
> >

Thanks
Barry