[RFC 0/2] mm: page_alloc: pcp buddy allocator

From: Johannes Weiner

Date: Fri Apr 03 2026 - 15:46:05 EST


Hi,

this is an RFC for making the page allocator scale better with higher
thread counts and larger memory quantities.

In Meta production, we're seeing increasing zone->lock contention that
was traced back to a few different paths. A prominent one is the
userspace allocator, jemalloc. Allocations happen from page faults on
all CPUs running the workload. Frees are cached for reuse, but the
caches are periodically purged back to the kernel from a handful of
purger threads. This breaks affinity between allocations and frees:
Both sides use their own PCPs - one side depletes them, the other one
overfills them. Both sides routinely hit the zone->locked slowpath.

My understanding is that tcmalloc has a similar architecture.

Another contributor to contention is process exits, where large
numbers of pages are freed at once. The current PCP can only reduce
lock time when pages are reused. Reuse is unlikely because it's an
avalanche of free pages on a CPU busy walking page tables. Every time
the PCP overflows, the drain acquires the zone->lock and frees pages
one by one, trying to merge buddies together.

The idea proposed here is this: instead of single pages, make the PCP
grab entire pageblocks, split them outside the zone->lock. That CPU
then takes ownership of the block, and all frees route back to that
PCP instead of the freeing CPU's local one.

This has several benefits:

1. It's right away coarser/fewer allocations transactions under the
zone->lock.

1a. Even if no full free blocks are available (memory pressure or
small zone), with splitting available at the PCP level means the
PCP can still grab chunks larger than the requested order from the
zone->lock freelists, and dole them out on its own time.

2. The pages free back to where the allocations happen, increasing the
odds of reuse and reducing the chances of zone->lock slowpaths.

3. The page buddies come back into one place, allowing upfront merging
under the local pcp->lock. This makes coarser/fewer freeing
transactions under the zone->lock.

The big concern is fragmentation. Movable allocations tend to be a mix
of short-lived anon and long-lived file cache pages. By the time the
PCP needs to drain due to thresholds or pressure, the blocks might not
be fully re-assembled yet. To prevent gobbling up and fragmenting ever
more blocks, partial blocks are remembered on drain and their pages
queued last on the zone freelist. When a PCP refills, it first tries
to recover any such fragment blocks.

On small or pressured machines, the PCP degrades to its previous
behavior. If a whole block doesn't fit the pcp->high limit, or a whole
block isn't available, the refill grabs smaller chunks that aren't
marked for ownership. The free side will use the local PCP as before.

I still need to run broader benchmarks, but I've been consistently
seeing a 3-4% reduction in %sys time for simple kernel builds on my
32-way, 32G RAM test machine.

A synthetic test on the same machine that allocates on many CPUs and
frees on just a few sees a consistent 1% increase in throughput.

I would expect those numbers to increase with higher concurrency and
larger memory volumes, but verifying that is TBD.

Sending an RFC to get an early gauge on direction.

Based on 0257f64bdac7fdca30fa3cae0df8b9ecbec7733a.

include/linux/mmzone.h | 38 ++-
include/linux/page-flags.h | 9 +
mm/debug.c | 1 +
mm/internal.h | 17 +
mm/mm_init.c | 25 +-
mm/page_alloc.c | 784 +++++++++++++++++++++++++++++++------------
mm/sparse.c | 3 +-
7 files changed, 622 insertions(+), 255 deletions(-)