Re: [PATCH] mm/page_alloc: skip high atomic reservation at or below costly order

From: JP Kobryn

Date: Wed May 27 2026 - 01:58:20 EST

On 5/25/26 2:11 AM, Vlastimil Babka (SUSE) wrote:

On 5/19/26 22:28, Johannes Weiner wrote:

On Mon, May 18, 2026 at 06:25:32PM -0700, JP Kobryn (Meta) wrote:

We're seeing a pattern in production where 2MB THP order-9 allocations are
failing due to fragmentation and triggering reclaim on systems with plenty
of free memory. Over time, the success rate of these THP allocations do not
increase at all.

Inspecting zone->vm_stat[NR_FREE_PAGES] via kprobe on compaction_suitable()
indicated the given zone had sufficient free pages for order-9 allocations,
yet they were going unused. Drilling down into the zone and inspecting
/proc/pagetypeinfo revealed why. Order-9 blocks were accumulating in the
zone's HighAtomic bucket (while zero were present in Movable). THP is
unable to draw blocks from HighAtomic since that bucket is not in the
fallback list.

The heuristic for reserving pageblocks in HighAtomic is that any atomic
allocation greater than order-0 will result in the full pageblock being
captured. This means that an order-1 atomic allocation will over-reserve by
256x, a full 512 pageblock.

Gate the reservation on order. Skip for allocations at or below
PAGE_ALLOC_COSTLY_ORDER. This prevents smaller atomic allocations from
reserving entire pageblocks, and significantly helps when THP is in use on
a fragmented but otherwise healthy system.

Testing was performed using an A/B instagram workload receiving prod
traffic. Each side had ~60 hosts with 64G memory. The patch resulted in
several gains:

Unpatched
HighAtomic pageblocks per host: 309-312 (1% of zone or 620MB),
...all order-9 blocks in HighAtomic
THP success rate: 1-6%
Compaction success rate: 0-2%
pgscan_kswapd (total across ~60 hosts, per minute): ~70.2M
Atomic order-4+ allocations: 0

Patched
HighAtomic pageblocks per host: 1
THP success rate: 44-78%
Compaction success rate: 24-47%
pgscan_kswapd (total across ~60 hosts, per minute): ~29.9M
Atomic order-4+ allocations: 0

This is an interesting patch. A couple of thoughts:

1. You disabled the highatomic reserve for this workload and it didn't
seem to matter. Presumably <costly orders don't need the protection.

2. Maxing out the reserves is odd. ALLOC_HIGHATOMIC allocations will
try reserved space first,

Hmm, but if the allocation succeeds before entering slowpath,
ALLOC_NON_BLOCK won't be set.
But reserving another block should mean we already exhausted the reserved ones.
Unreserving is only done when direct reclaim made some progress but failed
to produce a page. But if it works, or kswapd does the job, we won't enter it?

There was just no real pressure to invoke the unreserving. Let me know
if I'm misunderstanding the question.

and I'd expect things that are commonly
highatomic to be short-lived. Why don't we stop with a couple of
claimed highatomic blocks that get continuously recycled?

Maybe it's some big burst of highatomic allocations that leads to the
reservations and then they stay around "forever"?

I should add to the changelog the missing info that high frequency
net allocations are responsible for these high atomic reservations.
Even though the allocations are not necessarily long-lived, the
pageblocks remain high atomic.

If that's the case I think we should be perhaps looking at the unreserving
being done more proactively, rather than limiting things to costly order.

What are your thoughts if we instead look at it as: should we be reserving
full pageblocks for small allocations?

It seems to come down to whether we want the disproportionate protection of full
pageblocks (below costly order) for high atomic allocs vs letting them coalesce
in the buddy path. Is the data not enough to justify the latter?

3. The impact on THP and compaction success rate is pretty
extreme. How can 1% of memory throw such a wrench into the gears?

Maybe if ~all free memory is in the highatomic blocks, compaction can't be
effective much. Or some suitability check somewhere in reclaim+compaction
wrongly assumes the highatomic blocks are usable, so it won't do the work.

I could be missing something, but I spent some time tonight looking into
this and didn't find an issue in the compaction/reclaim suitability path.

__compaction_suitable() calls __zone_watermark_ok(), and that path
subtracts free MIGRATE_HIGHATOMIC pages from usable free memory for
callers without reserve access:

/*
* If the caller does not have rights to reserves below the min
* watermark then subtract the free pages reserved for highatomic.
*/
if (likely(!(alloc_flags & ALLOC_RESERVES)))
unusable_free += READ_ONCE(z->nr_free_highatomic);

So free highatomic pages are removed from the usable free count there.

Also, the suitable-free-block check in __zone_watermark_ok() only treats
MIGRATE_HIGHATOMIC as usable when alloc_flags includes
ALLOC_HIGHATOMIC (or ALLOC_OOM). __compaction_suitable() passes
ALLOC_CMA here (not ALLOC_HIGHATOMIC), so I don't think compaction is
incorrectly treating free highatomic blocks as usable.

The only caveat I noticed is the fragmentation accounting side:
fill_contig_page_info() / fragmentation_index() appear to count
free_area[order].nr_free across migratetypes, so fragmentation scoring
may look better than they really are. But that seems adjacent
to this patch.

I think though that by the time we consider reclaim or compaction we're
dealing with the aftermath. The patch prevents the problem from occurring
up front.