Re: [PATCH mm-new v3 3/3] mm/khugepaged: merge PTE scanning logic into a new helper

From: Nico Pache

Date: Thu Jun 18 2026 - 09:21:00 EST


On Thu, Jun 18, 2026 at 7:10 AM Lance Yang <lance.yang@xxxxxxxxx> wrote:
>
> +Cc David and Lorenzo, address oops fixed :)
>
> On Thu, Jun 18, 2026 at 06:22:49AM -0600, Nico Pache wrote:
> >On Tue, Oct 14, 2025 at 6:37 AM Lorenzo Stoakes
> ><lorenzo.stoakes@xxxxxxxxxx> wrote:
> >>
> >> On Wed, Oct 08, 2025 at 12:37:48PM +0800, Lance Yang wrote:
> >> > From: Lance Yang <lance.yang@xxxxxxxxx>
> >> >
> >> > As David suggested, the PTE scanning logic in hpage_collapse_scan_pmd()
> >> > and __collapse_huge_page_isolate() was almost duplicated.
> >> >
> >> > This patch cleans things up by moving all the common PTE checking logic
> >> > into a new shared helper, thp_collapse_check_pte(). While at it, we use
> >> > vm_normal_folio() instead of vm_normal_page().
> >> >
> >> > Suggested-by: David Hildenbrand <david@xxxxxxxxxx>
> >> > Suggested-by: Dev Jain <dev.jain@xxxxxxx>
> >> > Signed-off-by: Lance Yang <lance.yang@xxxxxxxxx>
> >>
> >> In general I like the idea, the implementation needs work though.
> >>
> >> Will check in more detail on respin
> >>
> >> > ---
> >> > mm/khugepaged.c | 243 ++++++++++++++++++++++++++----------------------
> >> > 1 file changed, 130 insertions(+), 113 deletions(-)
> >> >
> >> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >> > index b5c0295c3414..7116caae1fa4 100644
> >> > --- a/mm/khugepaged.c
> >> > +++ b/mm/khugepaged.c
> >> > @@ -61,6 +61,12 @@ enum scan_result {
> >> > SCAN_PAGE_FILLED,
> >> > };
> >> >
> >> > +enum pte_check_result {
> >> > + PTE_CHECK_SUCCEED,
> >> > + PTE_CHECK_CONTINUE,
> >>
> >> Don't love this logic - this feels like we're essentially abstracting
> >> control flow, I mean what does 'continue' mean here? Other than you're in a
> >> loop and please continue, which is relying a little too much on what the
> >> caller does.
> >>
> >> if we retain this logic something like PTE_CHECK_SKIP would make more sense.
> >>
> >>
> >> > + PTE_CHECK_FAIL,
> >> > +};
> >> > +
> >> > #define CREATE_TRACE_POINTS
> >> > #include <trace/events/huge_memory.h>
> >> >
> >> > @@ -533,62 +539,139 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte,
> >> > }
> >> > }
> >> >
> >> > +/*
> >> > + * thp_collapse_check_pte - Check if a PTE is suitable for THP collapse
> >> > + * @pte: The PTE to check
> >> > + * @vma: The VMA the PTE belongs to
> >> > + * @addr: The virtual address corresponding to this PTE
> >> > + * @foliop: On success, used to return a pointer to the folio
> >> > + * Must be non-NULL
> >> > + * @none_or_zero: Counter for none/zero PTEs. Must be non-NULL
> >> > + * @unmapped: Counter for swap PTEs. Can be NULL if not scanning swaps
> >> > + * @shared: Counter for shared pages. Must be non-NULL
> >> > + * @scan_result: Used to return the failure reason (SCAN_*) on a
> >> > + * PTE_CHECK_FAIL return. Must be non-NULL
> >> > + * @cc: Collapse control settings
> >>
> >> Do we really need this many parameters? THis is hard to follow.
> >>
> >> Of course it being me, I'd immediately prefer a helper struct :)
> >
> >Hey!
> >
> >Ive been trying to reimplement this patch, and started converting this
> >to a helper struct... At first glance it seems like the right
> >approach, but because the scan/isolate share two slightly different
> >implementations, we need to leave some logic in the parent. This
> >results in polluting the rest of the code significantly just to save
> >one ugly function call with many parameters.
>
> Been a while ... so I don't fully recall anymore :)
>
> If you already have a patch or series, maybe just post it and we can look
> at the actual diff.
>
> Probably easier than talking about it without seeing the code :D

Yeah fair point!

I think I found a good intermediate solution after all. Which was to
keep the folio local (not part of the helper), as that was the major
source of the noise.

I have a work-in-progress series to address a number of cleanups, the
code should be ready at the start of next week. I just want to test
and verify things first, and it's best not to post at the end of the
week :)

-- Nico

>
> Cheers, Lance
>
> >
> >not sure what to do about that tbh.
> >
> >e.g., (just one of many examples)
> >
> >
> >- NR_ISOLATED_ANON + folio_is_file_lru(folio),
> >- folio_nr_pages(folio));
> >- VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> >- VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
> >+ NR_ISOLATED_ANON + folio_is_file_lru(ctx.folio),
> >+ folio_nr_pages(ctx.folio));
> >+ VM_BUG_ON_FOLIO(!folio_test_locked(ctx.folio), ctx.folio);
> >+ VM_BUG_ON_FOLIO(folio_test_lru(ctx.folio), ctx.folio);
> >
> >- if (folio_test_large(folio))
> >- list_add_tail(&folio->lru, compound_pagelist);
> >+ if (folio_test_large(ctx.folio))
> >+ list_add_tail(ctx.folio->lru, compound_pagelist);
> >
> >Are we sure this helper structure is the right approach?
> >
> >-- Nico
> >
> >>
> >> > + *
> >> > + * Returns:
> >> > + * PTE_CHECK_SUCCEED - PTE is suitable, proceed with further checks
> >> > + * PTE_CHECK_CONTINUE - Skip this PTE and continue scanning
> >> > + * PTE_CHECK_FAIL - Abort collapse scan
> >> > + */
> >> > +static inline int thp_collapse_check_pte(pte_t pte, struct vm_area_struct *vma,
> >>
> >> There's no need for inline in an internal static function in a file.
> >>
> >> > + unsigned long addr, struct folio **foliop, int *none_or_zero,
> >> > + int *unmapped, int *shared, int *scan_result,
> >> > + struct collapse_control *cc)
> >> > +{
> >> > + struct folio *folio = NULL;
> >> > +
> >> > + if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {
> >> > + (*none_or_zero)++;
> >> > + if (!userfaultfd_armed(vma) &&
> >> > + (!cc->is_khugepaged ||
> >> > + *none_or_zero <= khugepaged_max_ptes_none)) {
> >> > + return PTE_CHECK_CONTINUE;
> >> > + } else {
> >> > + *scan_result = SCAN_EXCEED_NONE_PTE;
> >> > + count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> >> > + return PTE_CHECK_FAIL;
> >> > + }
> >> > + } else if (!pte_present(pte)) {
> >>
> >> You're now making the if-else issues with previous patches worse by
> >> returning which gets even checkpatch to warn you.
> >>
> >> if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {
> >>
> >> (of course note that I am not convinced you can drop the pte_present()
> >> check here)
> >>
> >> (*none_or_zero)++;
> >> if (!userfaultfd_armed(vma) &&
> >> (!cc->is_khugepaged ||
> >> *none_or_zero <= khugepaged_max_ptes_none))
> >> return PTE_CHECK_CONTINUE;
> >>
> >> *scan_result = SCAN_EXCEED_NONE_PTE;
> >> count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> >> return PTE_CHECK_FAIL;
> >> }
> >>
> >> if (!pte_present(pte)) {
> >> ...
> >> }
> >>
> >> But even that really needs seriously more refactoring - that condition
> >> above is horrible for instance so, e.g.:
> >>
> >> if (pte_none(pte) || is_zero_pfn(pte_pfn(pte))) {
> >> (*none_or_zero)++;
> >>
> >> /* Cases where this is acceptable. */
> >> if (!userfaultfd_armed(vma))
> >> return PTE_CHECK_SKIP;
> >> if (!cc->is_khugepaged)
> >> return PTE_CHECK_SKIP;
> >> if (*none_or_zero <= khugepaged_max_ptes_none)
> >> return PTE_CHECK_SKIP;
> >>
> >> /* Otherwise, we must abort the scan. */
> >> *scan_result = SCAN_EXCEED_NONE_PTE;
> >> count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> >> return PTE_CHECK_FAIL;
> >> }
> >>
> >> if (!pte_present(pte)) {
> >> ...
> >> }
> >>
> >> Improves things a lot.
> >>
> >> I do however _hate_ this (*blah)++; thing. A helper struct would help :)
> >>
> >>
> >>
> >> > + if (!unmapped) {
> >> > + *scan_result = SCAN_PTE_NON_PRESENT;
> >> > + return PTE_CHECK_FAIL;
> >>
> >> Can't we have a helper that sets the result, e.g.:
> >>
> >> return pte_check_fail(scan_result, SCAN_PTE_NON_PRESENT);
> >>
> >> static enum pte_check_result pte_check_fail(int *scan_result,
> >> enum pte_check_result result)
> >> {
> >> *scan_result = result;
> >> return PTE_CHECK_FAIL;
> >> }
> >>
> >> And same for success/skip
> >>
> >> Then all of these horrible if (blah) { *foo = bar; return baz; } can be
> >> made into
> >>
> >> if (blah)
> >> return pte_check_xxx(..., SCAN_PTE_...);
> >>
> >> > + }
> >> > +
> >> > + if (non_swap_entry(pte_to_swp_entry(pte))) {
> >> > + *scan_result = SCAN_PTE_NON_PRESENT;
> >> > + return PTE_CHECK_FAIL;
> >> > + }
> >> > +
> >> > + (*unmapped)++;
> >> > + if (!cc->is_khugepaged ||
> >> > + *unmapped <= khugepaged_max_ptes_swap) {
> >> > + /*
> >> > + * Always be strict with uffd-wp enabled swap
> >> > + * entries. Please see comment below for
> >> > + * pte_uffd_wp().
> >> > + */
> >> > + if (pte_swp_uffd_wp(pte)) {
> >> > + *scan_result = SCAN_PTE_UFFD_WP;
> >> > + return PTE_CHECK_FAIL;
> >> > + }
> >> > + return PTE_CHECK_CONTINUE;
> >> > + } else {
> >> > + *scan_result = SCAN_EXCEED_SWAP_PTE;
> >> > + count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> >> > + return PTE_CHECK_FAIL;
> >> > + }
> >> > + } else if (pte_uffd_wp(pte)) {
> >> > + /*
> >> > + * Don't collapse the page if any of the small PTEs are
> >> > + * armed with uffd write protection. Here we can also mark
> >> > + * the new huge pmd as write protected if any of the small
> >> > + * ones is marked but that could bring unknown userfault
> >> > + * messages that falls outside of the registered range.
> >> > + * So, just be simple.
> >> > + */
> >> > + *scan_result = SCAN_PTE_UFFD_WP;
> >> > + return PTE_CHECK_FAIL;
> >> > + }
> >> > +
> >>
> >> Again as all of the comments for previous series still apply here so
> >> obviously I don't like this as-is :)
> >>
> >> And as Andrew has pointed out, now checkpatch is finding the 'pointless
> >> else' issue (as mentioned above also).
> >>
> >> > + folio = vm_normal_folio(vma, addr, pte);
> >> > + if (unlikely(!folio) || unlikely(folio_is_zone_device(folio))) {
> >> > + *scan_result = SCAN_PAGE_NULL;
> >> > + return PTE_CHECK_FAIL;
> >> > + }
> >> > +
> >> > + if (!folio_test_anon(folio)) {
> >> > + VM_WARN_ON_FOLIO(true, folio);
> >> > + *scan_result = SCAN_PAGE_ANON;
> >> > + return PTE_CHECK_FAIL;
> >> > + }
> >> > +
> >> > + /*
> >> > + * We treat a single page as shared if any part of the THP
> >> > + * is shared.
> >> > + */
> >> > + if (folio_maybe_mapped_shared(folio)) {
> >> > + (*shared)++;
> >> > + if (cc->is_khugepaged && *shared > khugepaged_max_ptes_shared) {
> >> > + *scan_result = SCAN_EXCEED_SHARED_PTE;
> >> > + count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> >> > + return PTE_CHECK_FAIL;
> >> > + }
> >> > + }
> >> > +
> >> > + *foliop = folio;
> >> > +
> >> > + return PTE_CHECK_SUCCEED;
> >> > +}
> >> > +
> >> > static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >> > unsigned long start_addr,
> >> > pte_t *pte,
> >> > struct collapse_control *cc,
> >> > struct list_head *compound_pagelist)
> >> > {
> >> > - struct page *page = NULL;
> >> > struct folio *folio = NULL;
> >> > unsigned long addr = start_addr;
> >> > pte_t *_pte;
> >> > int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
> >> > + int pte_check_res;
> >> >
> >> > for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
> >> > _pte++, addr += PAGE_SIZE) {
> >> > pte_t pteval = ptep_get(_pte);
> >> > - if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> >> > - ++none_or_zero;
> >> > - if (!userfaultfd_armed(vma) &&
> >> > - (!cc->is_khugepaged ||
> >> > - none_or_zero <= khugepaged_max_ptes_none)) {
> >> > - continue;
> >> > - } else {
> >> > - result = SCAN_EXCEED_NONE_PTE;
> >> > - count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> >> > - goto out;
> >> > - }
> >> > - } else if (!pte_present(pteval)) {
> >> > - result = SCAN_PTE_NON_PRESENT;
> >> > - goto out;
> >> > - } else if (pte_uffd_wp(pteval)) {
> >> > - result = SCAN_PTE_UFFD_WP;
> >> > - goto out;
> >> > - }
> >> > - page = vm_normal_page(vma, addr, pteval);
> >> > - if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
> >> > - result = SCAN_PAGE_NULL;
> >> > - goto out;
> >> > - }
> >> >
> >> > - folio = page_folio(page);
> >> > - if (!folio_test_anon(folio)) {
> >> > - VM_WARN_ON_FOLIO(true, folio);
> >> > - result = SCAN_PAGE_ANON;
> >> > - goto out;
> >> > - }
> >> > + pte_check_res = thp_collapse_check_pte(pteval, vma, addr,
> >> > + &folio, &none_or_zero, NULL, &shared,
> >> > + &result, cc);
> >> >
> >> > - /* See hpage_collapse_scan_pmd(). */
> >> > - if (folio_maybe_mapped_shared(folio)) {
> >> > - ++shared;
> >> > - if (cc->is_khugepaged &&
> >> > - shared > khugepaged_max_ptes_shared) {
> >> > - result = SCAN_EXCEED_SHARED_PTE;
> >> > - count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> >> > - goto out;
> >> > - }
> >> > - }
> >> > + if (pte_check_res == PTE_CHECK_CONTINUE)
> >> > + continue;
> >> > + else if (pte_check_res == PTE_CHECK_FAIL)
> >> > + goto out;
> >> >
> >> > if (folio_test_large(folio)) {
> >> > struct folio *f;
> >> > @@ -1264,11 +1347,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> >> > pte_t *pte, *_pte;
> >> > int result = SCAN_FAIL, referenced = 0;
> >> > int none_or_zero = 0, shared = 0;
> >> > - struct page *page = NULL;
> >> > struct folio *folio = NULL;
> >> > unsigned long addr;
> >> > spinlock_t *ptl;
> >> > int node = NUMA_NO_NODE, unmapped = 0;
> >> > + int pte_check_res;
> >> >
> >> > VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
> >> >
> >> > @@ -1287,81 +1370,15 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> >> > for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> >> > _pte++, addr += PAGE_SIZE) {
> >> > pte_t pteval = ptep_get(_pte);
> >> > - if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> >> > - ++none_or_zero;
> >> > - if (!userfaultfd_armed(vma) &&
> >> > - (!cc->is_khugepaged ||
> >> > - none_or_zero <= khugepaged_max_ptes_none)) {
> >> > - continue;
> >> > - } else {
> >> > - result = SCAN_EXCEED_NONE_PTE;
> >> > - count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> >> > - goto out_unmap;
> >> > - }
> >> > - } else if (!pte_present(pteval)) {
> >> > - if (non_swap_entry(pte_to_swp_entry(pteval))) {
> >> > - result = SCAN_PTE_NON_PRESENT;
> >> > - goto out_unmap;
> >> > - }
> >> >
> >> > - ++unmapped;
> >> > - if (!cc->is_khugepaged ||
> >> > - unmapped <= khugepaged_max_ptes_swap) {
> >> > - /*
> >> > - * Always be strict with uffd-wp
> >> > - * enabled swap entries. Please see
> >> > - * comment below for pte_uffd_wp().
> >> > - */
> >> > - if (pte_swp_uffd_wp(pteval)) {
> >> > - result = SCAN_PTE_UFFD_WP;
> >> > - goto out_unmap;
> >> > - }
> >> > - continue;
> >> > - } else {
> >> > - result = SCAN_EXCEED_SWAP_PTE;
> >> > - count_vm_event(THP_SCAN_EXCEED_SWAP_PTE);
> >> > - goto out_unmap;
> >> > - }
> >> > - } else if (pte_uffd_wp(pteval)) {
> >> > - /*
> >> > - * Don't collapse the page if any of the small
> >> > - * PTEs are armed with uffd write protection.
> >> > - * Here we can also mark the new huge pmd as
> >> > - * write protected if any of the small ones is
> >> > - * marked but that could bring unknown
> >> > - * userfault messages that falls outside of
> >> > - * the registered range. So, just be simple.
> >> > - */
> >> > - result = SCAN_PTE_UFFD_WP;
> >> > - goto out_unmap;
> >> > - }
> >> > + pte_check_res = thp_collapse_check_pte(pteval, vma, addr,
> >> > + &folio, &none_or_zero, &unmapped,
> >> > + &shared, &result, cc);
> >> >
> >> > - page = vm_normal_page(vma, addr, pteval);
> >> > - if (unlikely(!page) || unlikely(is_zone_device_page(page))) {
> >> > - result = SCAN_PAGE_NULL;
> >> > - goto out_unmap;
> >> > - }
> >> > - folio = page_folio(page);
> >> > -
> >> > - if (!folio_test_anon(folio)) {
> >> > - VM_WARN_ON_FOLIO(true, folio);
> >> > - result = SCAN_PAGE_ANON;
> >> > + if (pte_check_res == PTE_CHECK_CONTINUE)
> >> > + continue;
> >> > + else if (pte_check_res == PTE_CHECK_FAIL)
> >> > goto out_unmap;
> >> > - }
> >> > -
> >> > - /*
> >> > - * We treat a single page as shared if any part of the THP
> >> > - * is shared.
> >> > - */
> >> > - if (folio_maybe_mapped_shared(folio)) {
> >> > - ++shared;
> >> > - if (cc->is_khugepaged &&
> >> > - shared > khugepaged_max_ptes_shared) {
> >> > - result = SCAN_EXCEED_SHARED_PTE;
> >> > - count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> >> > - goto out_unmap;
> >> > - }
> >> > - }
> >> >
> >> > /*
> >> > * Record which node the original page is from and save this
> >> > --
> >> > 2.49.0
> >> >
> >>
> >
> >
>