Re: [PATCH 1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise_free

From: Yin Fengwei
Date: Tue Feb 27 2024 - 03:34:01 EST




On 2/27/24 15:54, Barry Song wrote:
> On Tue, Feb 27, 2024 at 8:42 PM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote:
>>
>>
>>
>> On 2/27/24 15:21, Barry Song wrote:
>>> On Tue, Feb 27, 2024 at 8:11 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
>>>>
>>>> On Tue, Feb 27, 2024 at 8:02 PM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 2/27/24 14:40, Barry Song wrote:
>>>>>> On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2/27/24 10:17, Barry Song wrote:
>>>>>>>>> Like if we hit folio which is partially mapped to the range, don't split it but
>>>>>>>>> just unmap the mapping part from the range. Let page reclaim decide whether
>>>>>>>>> split the large folio or not (If it's not mapped to any other range,it will be
>>>>>>>>> freed as whole large folio. If part of it still mapped to other range,page reclaim
>>>>>>>>> can decide whether to split it or ignore it for current reclaim cycle).
>>>>>>>> Yes, we can. but we still have to play the ptes check game to avoid adding
>>>>>>>> folios multiple times to reclaim the list.
>>>>>>>>
>>>>>>>> I don't see too much difference between splitting in madvise and splitting
>>>>>>>> in vmscan. as our real purpose is avoiding splitting entirely mapped
>>>>>>>> large folios. for partial mapped large folios, if we split in madvise, then
>>>>>>>> we don't need to play the game of skipping folios while iterating PTEs.
>>>>>>>> if we don't split in madvise, we have to make sure the large folio is only
>>>>>>>> added in reclaimed list one time by checking if PTEs belong to the
>>>>>>>> previous added folio.
>>>>>>>
>>>>>>> If the partial mapped large folio is unmapped from the range, the related PTE
>>>>>>> become none. How could the folio be added to reclaimed list multiple times?
>>>>>>
>>>>>> in case we have 16 PTEs in a large folio.
>>>>>> PTE0 present
>>>>>> PTE1 present
>>>>>> PTE2 present
>>>>>> PTE3 none
>>>>>> PTE4 present
>>>>>> PTE5 none
>>>>>> PTE6 present
>>>>>> ....
>>>>>> the current code is scanning PTE one by one.
>>>>>> while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6...
>>>>> No. Before detect the folio is fully mapped to the range, we can't add folio
>>>>> to reclaim list because the partial mapped folio shouldn't be added. We can
>>>>> only scan PTE15 and know it's fully mapped.
>>>>
>>>> you never know PTE15 is the last one mapping to the large folio, PTE15 can
>>>> be mapping to a completely different folio with PTE0.
>>>>
>>>>>
>>>>> So, when scanning PTE0, we will not add folio. Then when hit PTE3, we know
>>>>> this is a partial mapped large folio. We will unmap it. Then all 16 PTEs
>>>>> become none.
>>>>
>>>> I don't understand why all 16PTEs become none as we set PTEs to none.
>>>> we set PTEs to swap entries till try_to_unmap_one called by vmscan.
>>>>
>>>>>
>>>>> If the large folio is fully mapped, the folio will be added to reclaim list
>>>>> after scan PTE15 and know it's fully mapped.
>>>>
>>>> our approach is calling pte_batch_pte while meeting the first pte, if
>>>> pte_batch_pte = 16,
>>>> then we add this folio to reclaim_list and skip the left 15 PTEs.
>>>
>>> Let's compare two different implementation, for partial mapped large folio
>>> with 8 PTEs as below,
>>>
>>> PTE0 present for large folio1
>>> PTE1 present for large folio1
>>> PTE2 present for another folio2
>>> PTE3 present for another folio3
>>> PTE4 present for large folio1
>>> PTE5 present for large folio1
>>> PTE6 present for another folio4
>>> PTE7 present for another folio5
>>>
>>> If we don't split in madvise(depend on vmscan to split after adding
>>> folio1), we will have
>> Let me clarify something here:
>>
>> I prefer that we don't split large folio here. Instead, we unmap the
>> large folio from this VMA range (I think you missed the unmap operation
>> I mentioned).
>
> I don't understand why we unmap as this is a MADV_PAGEOUT not
> an unmap. unmapping totally changes the semantics. Would you like
> to show pseudo code?
Oh. Yes. MADV_PAGEOUT is not suitable.

What about MADV_FREE?

>
> for MADV_PAGEOUT on swap-out, the last step is writing swap entries
> to replace PTEs which are present. I don't understand how an unmap
> can be involved in this process.
>
>>
>> The intention is trying best to avoid splitting the large folio. If
>> the folio is only partially mapped to this VMA range, it's likely it
>> will be reclaimed as whole large folio. Which brings benefit for lru
>> and zone lock contention comparing to splitting large folio.
>
> which also brings negative side effects such as redundant I/O.
> For example, if you have only one subpage left in a large folio,
> pageout will still write nr_pages subpages into swap, then immediately
> free them in swap.
>
>>
>> The thing I am not sure is unmapping from specific VMA range is not
>> available and whether it's worthy to add it.
>
> I think we might have the possibility to have some complex code to
> add folio1, folio2, folio3, folio4 and folio5 in the above example into
> reclaim_list while avoiding splitting folio1. but i really don't understand
> how unmap will work.
>
>>
>>> to make sure folio1, folio2, folio3, folio4, folio5 are added to
>>> reclaim_list by doing a complex
>>> game while scanning these 8 PTEs.
>>>
>>> if we split in madvise, they become:
>>>
>>> PTE0 present for large folioA - splitted from folio 1
>>> PTE1 present for large folioB - splitted from folio 1
>>> PTE2 present for another folio2
>>> PTE3 present for another folio3
>>> PTE4 present for large folioC - splitted from folio 1
>>> PTE5 present for large folioD - splitted from folio 1
>>> PTE6 present for another folio4
>>> PTE7 present for another folio5
>>>
>>> we simply add the above 8 folios into reclaim_list one by one.
>>>
>>> I would vote for splitting for partial mapped large folio in madvise.
>>>
>
> Thanks
> Barry