Re: [RFC 0/2] mm: page_alloc: pcp buddy allocator

From: Zi Yan

Date: Fri Apr 03 2026 - 22:28:35 EST

On 3 Apr 2026, at 15:40, Johannes Weiner wrote:

> Hi,
>
> this is an RFC for making the page allocator scale better with higher
> thread counts and larger memory quantities.
>
> In Meta production, we're seeing increasing zone->lock contention that
> was traced back to a few different paths. A prominent one is the
> userspace allocator, jemalloc. Allocations happen from page faults on
> all CPUs running the workload. Frees are cached for reuse, but the
> caches are periodically purged back to the kernel from a handful of
> purger threads. This breaks affinity between allocations and frees:
> Both sides use their own PCPs - one side depletes them, the other one
> overfills them. Both sides routinely hit the zone->locked slowpath.
>
> My understanding is that tcmalloc has a similar architecture.
>
> Another contributor to contention is process exits, where large
> numbers of pages are freed at once. The current PCP can only reduce
> lock time when pages are reused. Reuse is unlikely because it's an
> avalanche of free pages on a CPU busy walking page tables. Every time
> the PCP overflows, the drain acquires the zone->lock and frees pages
> one by one, trying to merge buddies together.

IIUC, zone->lock held time is mostly spent on free page merging.
Have you tried to let PCP do the free page merging before holding
zone->lock and returning free pages to buddy? That is a much smaller
change than what you proposed. This method might not work if
physically contiguous free pages are allocated by separate CPUs,
so that PCP merging cannot be done. But this might be rare?

>
> The idea proposed here is this: instead of single pages, make the PCP
> grab entire pageblocks, split them outside the zone->lock. That CPU
> then takes ownership of the block, and all frees route back to that
> PCP instead of the freeing CPU's local one.

This is basically distributed buddy allocators, right? Instead of
relying on a single zone->lock, PCP locks are used. The worst case
it can face is that physically contiguous free pages are allocated
across all CPUs, so that all CPUs are competing a single PCP lock.
It seems that you have not hit this. So I wonder if what I proposed
above might work as a simpler approach. Let me know if I miss anything.

I wonder how this distributed buddy allocators would work if anyone
wants to allocate >pageblock free pages, like alloc_contig_range().
Multiple PCP locks need to be taken one by one. Maybe it is better
than taking and dropping zone->lock repeatedly. Have you benchmarked
alloc_contig_range(), like hugetlb allocation?

>
> This has several benefits:
>
> 1. It's right away coarser/fewer allocations transactions under the
> zone->lock.
>
> 1a. Even if no full free blocks are available (memory pressure or
> small zone), with splitting available at the PCP level means the
> PCP can still grab chunks larger than the requested order from the
> zone->lock freelists, and dole them out on its own time.
>
> 2. The pages free back to where the allocations happen, increasing the
> odds of reuse and reducing the chances of zone->lock slowpaths.
>
> 3. The page buddies come back into one place, allowing upfront merging
> under the local pcp->lock. This makes coarser/fewer freeing
> transactions under the zone->lock.

I wonder if we could go more radical by moving buddy allocator out of
zone->lock completely to PCP lock. If one PCP runs out of free pages,
it can steal another PCP's whole pageblock. I probably should do some
literature investigation on this. Some research must have been done
on this.

>
> The big concern is fragmentation. Movable allocations tend to be a mix
> of short-lived anon and long-lived file cache pages. By the time the
> PCP needs to drain due to thresholds or pressure, the blocks might not
> be fully re-assembled yet. To prevent gobbling up and fragmenting ever
> more blocks, partial blocks are remembered on drain and their pages
> queued last on the zone freelist. When a PCP refills, it first tries
> to recover any such fragment blocks.
>
> On small or pressured machines, the PCP degrades to its previous
> behavior. If a whole block doesn't fit the pcp->high limit, or a whole
> block isn't available, the refill grabs smaller chunks that aren't
> marked for ownership. The free side will use the local PCP as before.
>
> I still need to run broader benchmarks, but I've been consistently
> seeing a 3-4% reduction in %sys time for simple kernel builds on my
> 32-way, 32G RAM test machine.
>
> A synthetic test on the same machine that allocates on many CPUs and
> frees on just a few sees a consistent 1% increase in throughput.
>
> I would expect those numbers to increase with higher concurrency and
> larger memory volumes, but verifying that is TBD.
>
> Sending an RFC to get an early gauge on direction.

Thank you for sending this out. :)

>
> Based on 0257f64bdac7fdca30fa3cae0df8b9ecbec7733a.
>
> include/linux/mmzone.h | 38 ++-
> include/linux/page-flags.h | 9 +
> mm/debug.c | 1 +
> mm/internal.h | 17 +
> mm/mm_init.c | 25 +-
> mm/page_alloc.c | 784 +++++++++++++++++++++++++++++++------------
> mm/sparse.c | 3 +-
> 7 files changed, 622 insertions(+), 255 deletions(-)

--
Best Regards,
Yan, Zi