Re: [PATCH mm-new v8 2/4] mm: khugepaged: refine scan progress number

From: Vernon Yang

Date: Thu Feb 26 2026 - 09:44:05 EST

On Wed, Feb 25, 2026 at 03:29:05PM +0100, David Hildenbrand (Arm) wrote:
> On 2/25/26 15:25, Vernon Yang wrote:
> > On Tue, Feb 24, 2026 at 03:52:47AM +0000, Wei Yang wrote:
> >> On Sat, Feb 21, 2026 at 05:39:16PM +0800, Vernon Yang wrote:
> >>> From: Vernon Yang <yanglincheng@xxxxxxxxxx>
> >>>
> >>> Currently, each scan always increases "progress" by HPAGE_PMD_NR,
> >>> even if only scanning a single PTE/PMD entry.
> >>>
> >>> - When only scanning a sigle PTE entry, let me provide a detailed
> >>> example:
> >>>
> >>> static int hpage_collapse_scan_pmd()
> >>> {
> >>> for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> >>> _pte++, addr += PAGE_SIZE) {
> >>> pte_t pteval = ptep_get(_pte);
> >>> ...
> >>> if (pte_uffd_wp(pteval)) { <-- first scan hit
> >>> result = SCAN_PTE_UFFD_WP;
> >>> goto out_unmap;
> >>> }
> >>> }
> >>> }
> >>>
> >>> During the first scan, if pte_uffd_wp(pteval) is true, the loop exits
> >>> directly. In practice, only one PTE is scanned before termination.
> >>> Here, "progress += 1" reflects the actual number of PTEs scanned, but
> >>> previously "progress += HPAGE_PMD_NR" always.
> >>>
> >>> - When the memory has been collapsed to PMD, let me provide a detailed
> >>> example:
> >>>
> >>> The following data is traced by bpftrace on a desktop system. After
> >>> the system has been left idle for 10 minutes upon booting, a lot of
> >>> SCAN_PMD_MAPPED or SCAN_NO_PTE_TABLE are observed during a full scan
> >>> by khugepaged.
> >>>
> >>> >From trace_mm_khugepaged_scan_pmd and trace_mm_khugepaged_scan_file, the
> >>> following statuses were observed, with frequency mentioned next to them:
> >>>
> >>> SCAN_SUCCEED : 1
> >>> SCAN_EXCEED_SHARED_PTE: 2
> >>> SCAN_PMD_MAPPED : 142
> >>> SCAN_NO_PTE_TABLE : 178
> >>> total progress size : 674 MB
> >>> Total time : 419 seconds, include khugepaged_scan_sleep_millisecs
> >>>
> >>> The khugepaged_scan list save all task that support collapse into hugepage,
> >>> as long as the task is not destroyed, khugepaged will not remove it from
> >>> the khugepaged_scan list. This exist a phenomenon where task has already
> >>> collapsed all memory regions into hugepage, but khugepaged continues to
> >>> scan it, which wastes CPU time and invalid, and due to
> >>> khugepaged_scan_sleep_millisecs (default 10s) causes a long wait for
> >>> scanning a large number of invalid task, so scanning really valid task
> >>> is later.
> >>>
> >>> After applying this patch, when the memory is either SCAN_PMD_MAPPED or
> >>> SCAN_NO_PTE_TABLE, just skip it, as follow:
> >>>
> >>> SCAN_EXCEED_SHARED_PTE: 2
> >>> SCAN_PMD_MAPPED : 147
> >>> SCAN_NO_PTE_TABLE : 173
> >>> total progress size : 45 MB
> >>> Total time : 20 seconds
> >>>
> >>> SCAN_PTE_MAPPED_HUGEPAGE is the same, for detailed data, refer to
> >>> https://lore.kernel.org/linux-mm/4qdu7owpmxfh3ugsue775fxarw5g2gcggbxdf5psj75nnu7z2u@cv2uu2yocaxq
> >>>
> >>> Signed-off-by: Vernon Yang <yanglincheng@xxxxxxxxxx>
> >>> Reviewed-by: Dev Jain <dev.jain@xxxxxxx>
> >>> ---
> >>> mm/khugepaged.c | 42 ++++++++++++++++++++++++++++++++----------
> >>> 1 file changed, 32 insertions(+), 10 deletions(-)
> >>>
> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >>> index e2f6b68a0011..61e25cf5424b 100644
> >>> --- a/mm/khugepaged.c
> >>> +++ b/mm/khugepaged.c
> >>> @@ -68,7 +68,10 @@ enum scan_result {
> >>> static struct task_struct *khugepaged_thread __read_mostly;
> >>> static DEFINE_MUTEX(khugepaged_mutex);
> >>>
> >>> -/* default scan 8*HPAGE_PMD_NR ptes (or vmas) every 10 second */
> >>> +/*
> >>> + * default scan 8*HPAGE_PMD_NR ptes, pmd_mapped, no_pte_table or vmas
> >>> + * every 10 second.
> >>> + */
> >>> static unsigned int khugepaged_pages_to_scan __read_mostly;
> >>> static unsigned int khugepaged_pages_collapsed;
> >>> static unsigned int khugepaged_full_scans;
> >>> @@ -1231,7 +1234,8 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >>> }
> >>>
> >>> static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >>> - struct vm_area_struct *vma, unsigned long start_addr, bool *mmap_locked,
> >>> + struct vm_area_struct *vma, unsigned long start_addr,
> >>> + bool *mmap_locked, unsigned int *cur_progress,
> >>> struct collapse_control *cc)
> >>> {
> >>> pmd_t *pmd;
> >>> @@ -1247,19 +1251,27 @@ static enum scan_result hpage_collapse_scan_pmd(struct mm_struct *mm,
> >>> VM_BUG_ON(start_addr & ~HPAGE_PMD_MASK);
> >>>
> >>> result = find_pmd_or_thp_or_none(mm, start_addr, &pmd);
> >>> - if (result != SCAN_SUCCEED)
> >>> + if (result != SCAN_SUCCEED) {
> >>> + if (cur_progress)
> >>> + *cur_progress = 1;
> >>> goto out;
> >>> + }
> >>
> >> How about put cur_progress in struct collapse_control?
> >>
> >> Then we don't need to check cur_progress every time before modification.
> >
> > Thank you for suggestion.
> >
> > Placing it inside "struct collapse_control" makes the overall code
> > simpler, there also coincidentally has a 4-bytes hole, as shown below:
> >
> > struct collapse_control {
> > bool is_khugepaged; /* 0 1 */
> >
> > /* XXX 3 bytes hole, try to pack */
> >
> > u32 node_load[64]; /* 4 256 */
> >
> > /* XXX 4 bytes hole, try to pack */
> >
> > /* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */
> > nodemask_t alloc_nmask; /* 264 8 */
> >
> > /* size: 272, cachelines: 5, members: 3 */
> > /* sum members: 265, holes: 2, sum holes: 7 */
> > /* last cacheline: 16 bytes */
> > };
> >
> > But regardless of khugepaged or madvise(MADV_COLLAPSE), "cur_progress"
> > will be counted, while madvise(MADV_COLLAPSE) actually does not need to
> > be counted.
> >
> > David, do we want to place "cur_progress" inside the "struct collapse_control"?
>
> Might end up looking nicer code-wise. But the reset semantics (within a
> pmd) are a bit weird.
>
> > If Yes, it would be better to rename "cur_progress" to "pmd_progress",
> > as show below:
> >
>
> "pmd_progress" is misleading. "progress_in_pmd" might be clearer.
>
> Play with it to see if it looks better :)

Hi Andrew, David,

Based on previous discussions [1], v2 as follow, and testing shows the
same performance benefits. Just make code cleaner, no function changes.

If David has no further revisions, Andrew, could you please squash the
following clean into this patch? If you prefer a new version, please let
me know. Thanks.

[1] https://lore.kernel.org/linux-mm/zdvzmoop5xswqcyiwmvvrdfianm4ccs3gryfecwbm4bhuh7ebo@7an4huwgbuwo

---