Re: [PATCH mm-unstable v18 11/14] mm/khugepaged: Introduce mTHP collapse support

From: Nico Pache

Date: Tue Jun 02 2026 - 13:28:48 EST


On Tue, Jun 2, 2026 at 11:22 AM Nico Pache <npache@xxxxxxxxxx> wrote:
>
>
>
> On 6/1/26 7:15 AM, David Hildenbrand (Arm) wrote:
> >>>
> >>> Reading this, it is unclear why exactly do we need the stack.
> >>
> >> So I looked into your items below. It seems logical, and I think it
> >> works the same way; however, your method seems slightly harder to
> >> understand due to all the edge cases and more error-prone to future
> >> changes (the stack holds implicit knowledge of the offset/order that
> >> must now be tracked in the edge cases).
> >>
> >> Given the stack is 24 bytes, I'm not sure if the extra complexity is
> >> worth saving that small amount of memory. Although we would also be
> >> getting rid of (3?) functions, so both approaches have pros and cons.
> >
> > I consider a simple forward loop over the offset ... less complexity compared to
> > a stack structure :)
> >
> >>
> >> I will implement a patch comparing your solution against mine and send
> >> it here, then we can decide which approach is better.
> >
> > Right, throw it over the fence and I'll see how to improve it further.
>
> Ok heres what the diff looks like on top of my V19.
>
> you can access the tree here https://gitlab.com/npache/linux/-/commits/mthp-v19?ref_type=heads for easier review.
>
> So far I have no problem with this approach it appeared cleaner than i thought. Did some light testing. Gonna throw it more through the ringer tomorrow.

not sure why this didnt send with the proper encoding I guess my email
is still a little screwed up

>
>
> From 9496c5d17eba7f6d04820d78c7c6f1592a58888a Mon Sep 17 00:00:00 2001
> From: Nico Pache <npache@xxxxxxxxxx>
> Date: Tue, 2 Jun 2026 10:26:18 -0600
> Subject: [PATCH] convert from stack to forward loop
>
> Signed-off-by: Nico Pache <npache@xxxxxxxxxx>
> ---
> mm/khugepaged.c | 96 ++++++++-----------------------------------------
> 1 file changed, 15 insertions(+), 81 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 498eba009751..6de935e76ceb 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -100,28 +100,6 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> static struct kmem_cache *mm_slot_cache __ro_after_init;
>
> #define KHUGEPAGED_MIN_MTHP_ORDER 2
> -/*
> - * mthp_collapse() does an iterative DFS over a binary tree, from
> - * HPAGE_PMD_ORDER down to KHUGEPAGED_MIN_MTHP_ORDER. The max stack
> - * size needed for a DFS on a binary tree is height + 1, where
> - * height = HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER.
> - *
> - * ilog2 is used in place of HPAGE_PMD_ORDER because some architectures
> - * (e.g. ppc64le) do not define HPAGE_PMD_ORDER until after build time.
> - */
> -#define MTHP_STACK_SIZE (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER + 1)
> -
> -/*
> - * Defines a range of PTE entries in a PTE page table which are being
> - * considered for mTHP collapse.
> - *
> - * @offset: the offset of the first PTE entry in a PMD range.
> - * @order: the order of the PTE entries being considered for collapse.
> - */
> -struct mthp_range {
> - u16 offset;
> - u8 order;
> -};
>
> struct collapse_control {
> bool is_khugepaged;
> @@ -137,7 +115,6 @@ struct collapse_control {
>
> /* Each bit represents a single occupied (!none/zero) page. */
> DECLARE_BITMAP(mthp_present_ptes, MAX_PTRS_PER_PTE);
> - struct mthp_range mthp_bitmap_stack[MTHP_STACK_SIZE];
> };
>
> /**
> @@ -1458,50 +1435,14 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long s
> return result;
> }
>
> -static void collapse_mthp_stack_push(struct collapse_control *cc, int *stack_size,
> - u16 offset, u8 order)
> -{
> - const int size = *stack_size;
> - struct mthp_range *stack = &cc->mthp_bitmap_stack[size];
> -
> - VM_WARN_ON_ONCE(size >= MTHP_STACK_SIZE);
> - stack->order = order;
> - stack->offset = offset;
> - (*stack_size)++;
> -}
> -
> -static struct mthp_range collapse_mthp_stack_pop(struct collapse_control *cc,
> - int *stack_size)
> -{
> - const int size = *stack_size;
> -
> - VM_WARN_ON_ONCE(size <= 0);
> - (*stack_size)--;
> - return cc->mthp_bitmap_stack[size - 1];
> -}
> -
> /*
> * mthp_collapse() consumes the bitmap that is generated during
> * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> *
> * Each bit in cc->mthp_present_ptes represents a single occupied (!none/zero)
> - * page. A stack structure cc->mthp_bitmap_stack is used to check different
> - * regions of the bitmap for collapse eligibility. The stack maintains a pair
> - * of variables (offset, order), indicating the number of PTEs from the start
> - * of the PMD, and the order of the potential collapse candidate respectively.
> - * We start at the PMD order and check if it is eligible for collapse; if not,
> - * we add two entries to the stack at a lower order to represent the left and
> - * right halves of the PTE page table we are examining.
> - *
> - * offset mid_offset
> - * | |
> - * | |
> - * v v
> - * --------------------------------------
> - * | cc->mthp_present_ptes |
> - * --------------------------------------
> - * <-------><------->
> - * order-1 order-1
> + * page. We start at the PMD order and check if it is eligible for collapse;
> + * if not, we check the left and right halves of the PTE page table we are
> + * examining at a lower order.
> *
> * For each of these, we determine how many PTE entries are occupied in the
> * range of PTE entries we propose to collapse, then we compare this to a
> @@ -1517,26 +1458,20 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
> {
> unsigned int nr_occupied_ptes, nr_ptes, max_ptes_none;
> enum scan_result last_result = SCAN_FAIL;
> - int collapsed = 0, stack_size = 0;
> + int collapsed = 0;
> bool alloc_failed = false;
> unsigned long collapse_address;
> - struct mthp_range range;
> - u16 offset;
> - u8 order;
> + unsigned int offset = 0;
> + unsigned int order = HPAGE_PMD_ORDER;
>
> - collapse_mthp_stack_push(cc, &stack_size, 0, HPAGE_PMD_ORDER);
>
> - while (stack_size) {
> - range = collapse_mthp_stack_pop(cc, &stack_size);
> - order = range.order;
> - offset = range.offset;
> + while (offset < HPAGE_PMD_NR) {
> nr_ptes = 1UL << order;
>
> if (!test_bit(order, &enabled_orders))
> goto next_order;
>
> max_ptes_none = collapse_max_ptes_none(cc, NULL, order);
> -
> nr_occupied_ptes = bitmap_weight_from(cc->mthp_present_ptes, offset,
> offset + nr_ptes);
>
> @@ -1553,7 +1488,7 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
> collapsed += nr_ptes;
> fallthrough;
> case SCAN_PTE_MAPPED_HUGEPAGE:
> - continue;
> + goto next_offset;
> /* Cases where lower orders might still succeed */
> case SCAN_ALLOC_HUGE_PAGE_FAIL:
> alloc_failed = true;
> @@ -1581,15 +1516,14 @@ static enum scan_result mthp_collapse(struct mm_struct *mm,
> }
>
> next_order:
> - if ((BIT(order) - 1) & enabled_orders) {
> - const u8 next_order = order - 1;
> - const u16 mid_offset = offset + (nr_ptes / 2);
> -
> - collapse_mthp_stack_push(cc, &stack_size, mid_offset,
> - next_order);
> - collapse_mthp_stack_push(cc, &stack_size, offset,
> - next_order);
> + if (order > KHUGEPAGED_MIN_MTHP_ORDER &&
> + (BIT(order) - 1) & enabled_orders) {
> + order = order - 1;
> + continue;
> }
> +next_offset:
> + offset += nr_ptes;
> + order = min_t(int, __ffs(offset), HPAGE_PMD_ORDER);
> }
> done:
> if (collapsed)
> --
> 2.54.0
>
>
>
> >
> > [...]
> >
> >>>> + bitmap_zero(cc->mthp_bitmap, MAX_PTRS_PER_PTE);
> >>>> memset(cc->node_load, 0, sizeof(cc->node_load));
> >>>> nodes_clear(cc->alloc_nmask);
> >>>> +
> >>>> + enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, tva_flags);
> >>>> +
> >>>> + /*
> >>>> + * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> >>>> + * scan all pages to populate the bitmap for mTHP collapse.
> >>>> + */
> >>>
> >>> You should note here, that we re-verify in mthp_collapse().
> >>>
> >>> But the question is, whether we should relocate the check completely into
> >>> mthp_collapse(), instead of conditionally duplicating it.
> >>>
> >>> What speaks against always populating the bitmap and making the decision in
> >>> mthp_collapse()?
> >>>
> >>> Sure, we might scan a page table a bit longer, but the code gets clearer ... and
> >>> I am not sure if scanning some more page table entries is really that critical here.
> >>
> >> Someone asked me to preserve the legacy behavior (PMD only). Although
> >> rather trivial, if you set max_ptes_none=0 for example, we'd still
> >> have to do 511 iterations for no reason if PMD collapse is the only
> >> enabled order rather than bailing immediately.
> >>
> >> I'm ok with dropping it, but I think its the correct approach (despite
> >> the extra complexity). @Usama Arif brought up this point here
> >> https://lore.kernel.org/all/f8f7bb71-ca31-46ee-a62d-7ddfd83e0ead@xxxxxxxxx/
> >
> > We talk about regressions, but I am not sure if we care about scanning speed
> > within a page table that much?
> >
> > After all, we locked it and already read some entries.
> >
> > Having the same check at two places to optimize for PMD order might right now
> > feel like a good optimization, but likely an irrelevant one in a near future?
> >
> > Anyhow, won't push back, as long as we document why we are special casing things
> > here.
> >
>