Re: [PATCH v4 2/3] mm, swap: allow archs to override SWAP_NR_ORDERS via ARCH_MAX_PMD_ORDER
From: Barry Song
Date: Tue Jun 23 2026 - 04:43:14 EST
On Tue, Jun 23, 2026 at 3:05 PM Ritesh Harjani <ritesh.list@xxxxxxxxx> wrote:
>
> Barry Song <baohua@xxxxxxxxxx> writes:
>
> > On Fri, Jun 19, 2026 at 12:41 PM Ritesh Harjani (IBM)
> > <ritesh.list@xxxxxxxxx> wrote:
> >>
> >> SWAP_NR_ORDERS sizes a few small bounded arrays inside THP swap
> >> allocator code (nofull/frag cluster lists, percpu_swap_cluster's
> >> si/offset arrays, next array for rotational device). This currently
> >> expands to PMD_ORDER+1, which only works when PMD_ORDER is a compile
> >> time constant.
> >>
> >> However on architecture like PowerPC Book3S64, PMD_ORDER is a runtime
> >> variable which depends upon which MMU is selected (Radix / Hash), so in
> >> that case, PMD_ORDER cannot be used to size the static arrays.
> >>
> >> This patch provides an optional ARCH_MAX_PMD_ORDER (upper-bound)
> >> override for such architectures. The memory overhead on enabling this
> >> override is negligible. Even if we make SWAP_NR_ORDERS runtime alloc,
> >> default slab padding could cause some memory waste. Also we lose the
> >> per-cpu cacheline benefits (for percpu_swap_cluster) because it might
> >> cost an extra cacheline indirection overhead in swap_alloc_fast() for
> >> fetching si[order]/offset[order]. Note that a fully runtime
> >> SWAP_NR_ORDERS was considered in previous version but was dropped for
> >> this reason [1]
> >
> > Do we know the maximum PMD size?
>
> ARCH_MAX_PMD_ORDER will be 8 on PowerPC book3s64 with 64K pagesize.
> PowerPC Hash MMU with 64K default pagesize supports PMD size of 16MB.
>
> > On arm64 with a 64 KB base page,
> > a PMD can be as large as 512 MB:
> > https://docs.kernel.org/arch/arm64/hugetlbpage.html
> >
> > One concern we have is that performing I/O on such a large folio could
> > incur significant latency before reclaiming any memory. For this
> > reason, on arm64 we initially enabled THP_SWAPOUT only for 4 KB base
> > pages:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d0637c505f
> >
>
> That's not the case on PowerPC. Max PMD size for Hash will be 16MB.
Yep. A 16 MB folio might be fine, although I'm not sure whether
splitting a 16 MB folio into eight 2 MB folios would help much.
For 512 MB PMD-sized pages on arm64, one possible approach might be to
split them into 256 × 2 MB folios rather than all the way down to 4 KB
pages. That could provide a better balance between I/O latency and swap
performance.
> Also we still need this patch since we can at runtime choose Hash or
> Radix MMU. So, the main problem this patch is trying to solve on PowerPC
> Book3s64 is enabling this feature w/o impacting any other architecture.
> W/O this patch series, we can't enable it, since it gives build errors.
I see. If possible, please mention in the changelog that the maximum
PMD size on your platform is 16 MB. In that case, the I/O latency
concerns I raised may not really apply.
w/ that, please free feel to add:
Reviewed-by: Barry Song <baohua@xxxxxxxxxx>