Re: [PATCH v4 2/3] mm, swap: allow archs to override SWAP_NR_ORDERS via ARCH_MAX_PMD_ORDER
From: Kairui Song
Date: Wed Jun 24 2026 - 06:27:22 EST
On Fri, Jun 19, 2026 at 12:42 PM Ritesh Harjani (IBM)
<ritesh.list@xxxxxxxxx> wrote:
>
> SWAP_NR_ORDERS sizes a few small bounded arrays inside THP swap
> allocator code (nofull/frag cluster lists, percpu_swap_cluster's
> si/offset arrays, next array for rotational device). This currently
> expands to PMD_ORDER+1, which only works when PMD_ORDER is a compile
> time constant.
>
> However on architecture like PowerPC Book3S64, PMD_ORDER is a runtime
> variable which depends upon which MMU is selected (Radix / Hash), so in
> that case, PMD_ORDER cannot be used to size the static arrays.
>
> This patch provides an optional ARCH_MAX_PMD_ORDER (upper-bound)
> override for such architectures. The memory overhead on enabling this
> override is negligible. Even if we make SWAP_NR_ORDERS runtime alloc,
> default slab padding could cause some memory waste. Also we lose the
> per-cpu cacheline benefits (for percpu_swap_cluster) because it might
> cost an extra cacheline indirection overhead in swap_alloc_fast() for
> fetching si[order]/offset[order]. Note that a fully runtime
> SWAP_NR_ORDERS was considered in previous version but was dropped for
> this reason [1]
>
> [1]: https://lore.kernel.org/linuxppc-dev/pl1zdksc.ritesh.list@xxxxxxxxx/
>
> Suggested-by: YoungJun Park <youngjun.park@xxxxxxx>
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx>
> ---
> arch/powerpc/include/asm/book3s/64/pgtable.h | 7 +++++++
> include/linux/swap.h | 12 +++++++++++-
> 2 files changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index e67e64ac6e8c..7f22d5d5fbdf 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -204,6 +204,13 @@ extern unsigned long __pmd_frag_size_shift;
> #define MAX_PTRS_PER_PGD (1 << (H_PGD_INDEX_SIZE > RADIX_PGD_INDEX_SIZE ? \
> H_PGD_INDEX_SIZE : RADIX_PGD_INDEX_SIZE))
>
> +/*
> + * Compile-time upper bound on PMD_ORDER across hash and radix MMUs.
> + * Used by THP SWAP code. Check include/linux/swap.h
> + */
> +#define ARCH_MAX_PMD_ORDER ((H_PTE_INDEX_SIZE > RADIX_PTE_INDEX_SIZE) ? \
> + H_PTE_INDEX_SIZE : RADIX_PTE_INDEX_SIZE)
Hi Ritesh
So swap is the only user of this macro? Will there by any other users?
I see that due to the percpu cluster design, it's hard to use a
flexible array here. We will probabaly get rid of the fixed percpu
cluster design in the future. By then should we be able to get rid of
this macro?
I'm OK with this approach though. This current design has no negative
effect on other archs so no reason to block it, just wondering if this
can be made simpler in the future :)