Re: [PATCH mm-unstable v15 07/13] mm/khugepaged: add per-order mTHP collapse failure statistics

From: Lorenzo Stoakes (Oracle)

Date: Tue Mar 17 2026 - 13:07:38 EST

On Wed, Feb 25, 2026 at 08:25:04PM -0700, Nico Pache wrote:
> Add three new mTHP statistics to track collapse failures for different
> orders when encountering swap PTEs, excessive none PTEs, and shared PTEs:
>
> - collapse_exceed_swap_pte: Increment when mTHP collapse fails due to swap
> PTEs
>
> - collapse_exceed_none_pte: Counts when mTHP collapse fails due to
> exceeding the none PTE threshold for the given order
>
> - collapse_exceed_shared_pte: Counts when mTHP collapse fails due to shared
> PTEs
>
> These statistics complement the existing THP_SCAN_EXCEED_* events by
> providing per-order granularity for mTHP collapse attempts. The stats are
> exposed via sysfs under
> `/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each
> supported hugepage size.
>
> As we currently dont support collapsing mTHPs that contain a swap or
> shared entry, those statistics keep track of how often we are
> encountering failed mTHP collapses due to these restrictions.
>
> Reviewed-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
> Signed-off-by: Nico Pache <npache@xxxxxxxxxx>
> ---
> Documentation/admin-guide/mm/transhuge.rst | 24 ++++++++++++++++++++++
> include/linux/huge_mm.h | 3 +++
> mm/huge_memory.c | 7 +++++++
> mm/khugepaged.c | 16 ++++++++++++---
> 4 files changed, 47 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index c51932e6275d..eebb1f6bbc6c 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -714,6 +714,30 @@ nr_anon_partially_mapped
> an anonymous THP as "partially mapped" and count it here, even though it
> is not actually partially mapped anymore.
>
> +collapse_exceed_none_pte
> + The number of collapse attempts that failed due to exceeding the
> + max_ptes_none threshold. For mTHP collapse, Currently only max_ptes_none
> + values of 0 and (HPAGE_PMD_NR - 1) are supported. Any other value will
> + emit a warning and no mTHP collapse will be attempted. khugepaged will

It's weird to document this here but not elsewhere in the document? I mean I
made this comment on the documentation patch also.

Not sure if I missed you adding it to another bit of the docs? :)

> + try to collapse to the largest enabled (m)THP size; if it fails, it will
> + try the next lower enabled mTHP size. This counter records the number of
> + times a collapse attempt was skipped for exceeding the max_ptes_none
> + threshold, and khugepaged will move on to the next available mTHP size.
> +
> +collapse_exceed_swap_pte
> + The number of anonymous mTHP PTE ranges which were unable to collapse due
> + to containing at least one swap PTE. Currently khugepaged does not
> + support collapsing mTHP regions that contain a swap PTE. This counter can
> + be used to monitor the number of khugepaged mTHP collapses that failed
> + due to the presence of a swap PTE.
> +
> +collapse_exceed_shared_pte
> + The number of anonymous mTHP PTE ranges which were unable to collapse due
> + to containing at least one shared PTE. Currently khugepaged does not
> + support collapsing mTHP PTE ranges that contain a shared PTE. This
> + counter can be used to monitor the number of khugepaged mTHP collapses
> + that failed due to the presence of a shared PTE.

All of these talk about 'ranges' that could be of any size. Are these useful
metrics? Counting a bunch of failures and not knowing if they are 256 KB
failures or 16 KB failures or whatever is maybe not so useful information?

Also, from the code, aren't you treating PMD events the same as mTHP ones from
the point of view of these counters? Maybe worth documenting that?

> +
> As the system ages, allocating huge pages may be expensive as the
> system uses memory compaction to copy data around memory to free a
> huge page for use. There are some counters in ``/proc/vmstat`` to help
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 9941fc6d7bd8..e8777bb2347d 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -144,6 +144,9 @@ enum mthp_stat_item {
> MTHP_STAT_SPLIT_DEFERRED,
> MTHP_STAT_NR_ANON,
> MTHP_STAT_NR_ANON_PARTIALLY_MAPPED,
> + MTHP_STAT_COLLAPSE_EXCEED_SWAP,
> + MTHP_STAT_COLLAPSE_EXCEED_NONE,
> + MTHP_STAT_COLLAPSE_EXCEED_SHARED,
> __MTHP_STAT_COUNT
> };
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 228f35e962b9..1049a207a257 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -642,6 +642,10 @@ DEFINE_MTHP_STAT_ATTR(split_failed, MTHP_STAT_SPLIT_FAILED);
> DEFINE_MTHP_STAT_ATTR(split_deferred, MTHP_STAT_SPLIT_DEFERRED);
> DEFINE_MTHP_STAT_ATTR(nr_anon, MTHP_STAT_NR_ANON);
> DEFINE_MTHP_STAT_ATTR(nr_anon_partially_mapped, MTHP_STAT_NR_ANON_PARTIALLY_MAPPED);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_swap_pte, MTHP_STAT_COLLAPSE_EXCEED_SWAP);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_none_pte, MTHP_STAT_COLLAPSE_EXCEED_NONE);
> +DEFINE_MTHP_STAT_ATTR(collapse_exceed_shared_pte, MTHP_STAT_COLLAPSE_EXCEED_SHARED);

Is there a reason there's such a difference between the names and the actual
enum names?

> +
>
> static struct attribute *anon_stats_attrs[] = {
> &anon_fault_alloc_attr.attr,
> @@ -658,6 +662,9 @@ static struct attribute *anon_stats_attrs[] = {
> &split_deferred_attr.attr,
> &nr_anon_attr.attr,
> &nr_anon_partially_mapped_attr.attr,
> + &collapse_exceed_swap_pte_attr.attr,
> + &collapse_exceed_none_pte_attr.attr,
> + &collapse_exceed_shared_pte_attr.attr,
> NULL,
> };
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index c739f26dd61e..a6cf90e09e4a 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -595,7 +595,9 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> continue;
> } else {
> result = SCAN_EXCEED_NONE_PTE;
> - count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> + if (is_pmd_order(order))
> + count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> + count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_NONE);

It's a bit gross to have separate stats for both thp and mthp but maybe
unavoidable from a legacy stand point.

Why are we dropping the _PTE suffix?

> goto out;
> }
> }
> @@ -631,10 +633,17 @@ static enum scan_result __collapse_huge_page_isolate(struct vm_area_struct *vma,
> * shared may cause a future higher order collapse on a
> * rescan of the same range.
> */
> - if (!is_pmd_order(order) || (cc->is_khugepaged &&
> - shared > khugepaged_max_ptes_shared)) {

OK losing track here :) as the series sadly doesn't currently apply so can't
browser file as is.

In the code I'm looking at, there's also a ++shared here that I guess another
patch removed?

Is this in the folio_maybe_mapped_shared() branch?

> + if (!is_pmd_order(order)) {
> + result = SCAN_EXCEED_SHARED_PTE;
> + count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> + goto out;
> + }
> +
> + if (cc->is_khugepaged &&
> + shared > khugepaged_max_ptes_shared) {
> result = SCAN_EXCEED_SHARED_PTE;
> count_vm_event(THP_SCAN_EXCEED_SHARED_PTE);
> + count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SHARED);
> goto out;

Anyway I'm a bit lost on this logic until a respin but this looks like a LOT of
code duplication. I see David alluded to a refactoring so maybe what he suggests
will help (not had a chance to check what it is specifically :P)

> }
> }
> @@ -1081,6 +1090,7 @@ static enum scan_result __collapse_huge_page_swapin(struct mm_struct *mm,
> * range.
> */
> if (!is_pmd_order(order)) {
> + count_mthp_stat(order, MTHP_STAT_COLLAPSE_EXCEED_SWAP);

Hmm I thought we were incrementing mthp stats for pmd sized also?

> pte_unmap(pte);
> mmap_read_unlock(mm);
> result = SCAN_EXCEED_SWAP_PTE;
> --
> 2.53.0
>

Cheers, Lorenzo