Re: [PATCH v6 2/3] mm/rmap: integrate PMD-mapped folio splitting into pagewalk loop

From: Lance Yang
Date: Wed Jun 05 2024 - 23:57:22 EST


On Thu, Jun 6, 2024 at 12:16 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 05.06.24 17:43, Lance Yang wrote:
> > On Wed, Jun 5, 2024 at 11:03 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
> >>
> >> On 05.06.24 16:57, Lance Yang wrote:
> >>> On Wed, Jun 5, 2024 at 10:39 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
> >>>>
> >>>> On 05.06.24 16:28, David Hildenbrand wrote:
> >>>>> On 05.06.24 16:20, Lance Yang wrote:
> >>>>>> Hi David,
> >>>>>>
> >>>>>> On Wed, Jun 5, 2024 at 8:46 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
> >>>>>>>
> >>>>>>> On 21.05.24 06:02, Lance Yang wrote:
> >>>>>>>> In preparation for supporting try_to_unmap_one() to unmap PMD-mapped
> >>>>>>>> folios, start the pagewalk first, then call split_huge_pmd_address() to
> >>>>>>>> split the folio.
> >>>>>>>>
> >>>>>>>> Since TTU_SPLIT_HUGE_PMD will no longer perform immediately, we might
> >>>>>>>> encounter a PMD-mapped THP missing the mlock in the VM_LOCKED range during
> >>>>>>>> the page walk. It’s probably necessary to mlock this THP to prevent it from
> >>>>>>>> being picked up during page reclaim.
> >>>>>>>>
> >>>>>>>> Suggested-by: David Hildenbrand <david@xxxxxxxxxx>
> >>>>>>>> Suggested-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
> >>>>>>>> Signed-off-by: Lance Yang <ioworker0@xxxxxxxxx>
> >>>>>>>> ---
> >>>>>>>
> >>>>>>> [...] again, sorry for the late review.
> >>>>>>
> >>>>>> No worries at all, thanks for taking time to review!
> >>>>>>
> >>>>>>>
> >>>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>>>>>>> index ddffa30c79fb..08a93347f283 100644
> >>>>>>>> --- a/mm/rmap.c
> >>>>>>>> +++ b/mm/rmap.c
> >>>>>>>> @@ -1640,9 +1640,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>>>>>>> if (flags & TTU_SYNC)
> >>>>>>>> pvmw.flags = PVMW_SYNC;
> >>>>>>>>
> >>>>>>>> - if (flags & TTU_SPLIT_HUGE_PMD)
> >>>>>>>> - split_huge_pmd_address(vma, address, false, folio);
> >>>>>>>> -
> >>>>>>>> /*
> >>>>>>>> * For THP, we have to assume the worse case ie pmd for invalidation.
> >>>>>>>> * For hugetlb, it could be much worse if we need to do pud
> >>>>>>>> @@ -1668,20 +1665,35 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>>>>>>> mmu_notifier_invalidate_range_start(&range);
> >>>>>>>>
> >>>>>>>> while (page_vma_mapped_walk(&pvmw)) {
> >>>>>>>> - /* Unexpected PMD-mapped THP? */
> >>>>>>>> - VM_BUG_ON_FOLIO(!pvmw.pte, folio);
> >>>>>>>> -
> >>>>>>>> /*
> >>>>>>>> * If the folio is in an mlock()d vma, we must not swap it out.
> >>>>>>>> */
> >>>>>>>> if (!(flags & TTU_IGNORE_MLOCK) &&
> >>>>>>>> (vma->vm_flags & VM_LOCKED)) {
> >>>>>>>> /* Restore the mlock which got missed */
> >>>>>>>> - if (!folio_test_large(folio))
> >>>>>>>> + if (!folio_test_large(folio) ||
> >>>>>>>> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
> >>>>>>>> mlock_vma_folio(folio, vma);
> >>>>>>>
> >>>>>>> Can you elaborate why you think this would be required? If we would have
> >>>>>>> performed the split_huge_pmd_address() beforehand, we would still be
> >>>>>>> left with a large folio, no?
> >>>>>>
> >>>>>> Yep, there would still be a large folio, but it wouldn't be PMD-mapped.
> >>>>>>
> >>>>>> After Weifeng's series[1], the kernel supports mlock for PTE-mapped large
> >>>>>> folio, but there are a few scenarios where we don't mlock a large folio, such
> >>>>>> as when it crosses a VM_LOCKed VMA boundary.
> >>>>>>
> >>>>>> - if (!folio_test_large(folio))
> >>>>>> + if (!folio_test_large(folio) ||
> >>>>>> + (!pvmw.pte && (flags & TTU_SPLIT_HUGE_PMD)))
> >>>>>>
> >>>>>> And this check is just future-proofing and likely unnecessary. If encountering a
> >>>>>> PMD-mapped THP missing the mlock for some reason, we can mlock this
> >>>>>> THP to prevent it from being picked up during page reclaim, since it is fully
> >>>>>> mapped and doesn't cross the VMA boundary, IIUC.
> >>>>>>
> >>>>>> What do you think?
> >>>>>> I would appreciate any suggestions regarding this check ;)
> >>>>>
> >>>>> Reading this patch only, I wonder if this change makes sense in the
> >>>>> context here.
> >>>>>
> >>>>> Before this patch, we would have PTE-mapped the PMD-mapped THP before
> >>>>> reaching this call and skipped it due to "!folio_test_large(folio)".
> >>>>>
> >>>>> After this patch, we either
> >>>>>
> >>>>> a) PTE-remap the THP after this check, but retry and end-up here again,
> >>>>> whereby we would skip it due to "!folio_test_large(folio)".
> >>>>>
> >>>>> b) Discard the PMD-mapped THP due to lazyfree directly. Can that
> >>>>> co-exist with mlock and what would be the problem here with mlock?
> >>>>>
> >>>>>
> >>>
> >>> Thanks a lot for clarifying!
> >>>
> >>>>> So if the check is required in this patch, we really have to understand
> >>>>> why. If not, we should better drop it from this patch.
> >>>>>
> >>>>> At least my opinion, still struggling to understand why it would be
> >>>>> required (I have 0 knowledge about mlock interaction with large folios :) ).
> >>>>>
> >>>>
> >>>> Looking at that series, in folio_references_one(), we do
> >>>>
> >>>> if (!folio_test_large(folio) || !pvmw.pte) {
> >>>> /* Restore the mlock which got missed */
> >>>> mlock_vma_folio(folio, vma);
> >>>> page_vma_mapped_walk_done(&pvmw);
> >>>> pra->vm_flags |= VM_LOCKED;
> >>>> return false; /* To break the loop */
> >>>> }
> >>>>
> >>>> I wonder if we want that here as well now: in case of lazyfree we
> >>>> would not back off, right?
> >>>>
> >>>> But I'm not sure if lazyfree in mlocked areas are even possible.
> >>>>
> >>>> Adding the "!pvmw.pte" would be much clearer to me than the flag check.
> >>>
> >>> Hmm... How about we drop it from this patch for now, and add it back if needed
> >>> in the future?
> >>
> >> If we can rule out that MADV_FREE + mlock() keeps working as expected in
> >> the PMD-mapped case, we're good.
> >>
> >> Can we rule that out? (especially for MADV_FREE followed by mlock())
> >
> > Perhaps we don't worry about that.
> >
> > IIUC, without that check, MADV_FREE + mlock() still works as expected in
> > the PMD-mapped case, since if encountering a large folio in a VM_LOCKED
> > VMA range, we will stop the page walk immediately.
>
>
> Can you point me at the code (especially considering patch #3?)

Yep, please see my other mail ;)

Thanks,
Lance

>
> --
> Cheers,
>
> David / dhildenb
>