Re: [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX

From: JP Kobryn

Date: Mon Jun 01 2026 - 21:49:23 EST

On 5/28/26 1:51 AM, Vlastimil Babka (SUSE) wrote:

On 5/27/26 02:10, JP Kobryn wrote:

On 5/25/26 3:02 AM, Vlastimil Babka (SUSE) wrote:

On 5/19/26 22:08, JP Kobryn (Meta) wrote:

compact_gap() returns 2 << order, which is used as watermark headroom in
__compaction_suitable() and as a reclaim target in kswapd. The computed
value scales exponentially by order. For order-9 THP allocations this
evaluates to 1024 pages, but the compaction free scanner's working set is
bounded by COMPACT_CLUSTER_MAX (32 pages). The scanner stops
isolating free
pages once it matches the migration batch. The current gap
over-reserves by
32x.

On fragmented production hosts, kswapd will try and reclaim up to the
gap,
but it only reaches that threshold 18% of the time, causing reclaim to
continue a majority of the time.

But doesn't that mean there's genuine memory pressure? We're effectively
raising the high watermark by 4 MB, but if processes are continuously
allocating, we'd be reclaiming without the gap as well? Unless the
workload
is sized to fit without the gap.

It wasn't actual pressure, but the repetitive order-9 THP failures that were
waking up kswapd. I should make this more clear in the changelog. After
looking into why so much reclaim was occurring though, the compact gap stood
out since it dictates the target amount to reclaim.

But the "amount to reclaim" is still defined as "reach high watermark +
compact_gap()" and not "reclaim at least compact_gap() pages" right? Or did
I miss something non-obvious.

Within kswapd_shrink_node(), sc->nr_to_reclaim is the sum of max(zone high
watermark or SWAP_CLUSTER_MAX) for each zone combined. The gap is not added to
that reclaim target though. It's used afterward as the threshold for abandoning
high order reclaim:

if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
sc->order = 0;

balance_pgdat() then returns sc->order and that becomes the kswapd reclaim_order
value, allowing this branch to be taken:

if (reclaim_order < alloc_order)
goto kswapd_try_sleep;

Then in prepare_kswapd_sleep(), if pgdat_balanced() succeeds (at order-0),
kcompactd is woken up for the original alloc_order (order-9).

So if kswapd did any work, it means the memory was consumed (i.e. there was
some memory pressure) and amount of free memory was below high watermark +
compact_gap()?

Hmm but kswapd can be woken up on a high order failure despite plenty of lower
order availability. That's really the case where compact_gap() matters for
higher orders. Unless by pressure you mean the high order pages were gone?

BTW, are you using mglru here? (probably not)
As that might be different and I'm not so familiar with it.

Using classic LRU.

The over-sized gap also causes 46% of
order-9 compaction suitability checks to fail unnecessarily - the
zone has
sufficient free pages for the scanner to operate, but not enough to clear
the inflated threshold.

Cap compact_gap() at COMPACT_CLUSTER_MAX to align the watermark headroom
with the scanner's actual capacity. Orders 0-4 are unaffected since their
gap is <= 32.

A/B test on ~100 instagram production hosts (64GB, 60s measurement):

What was the base kernel version?

6.13. Additional benchmarks were done using a recent mm-new build as well,
and they showed similar reductions in reclaim.

If it's a NUMA machine, we recently found an over-reclaim issue there fixed
by 9c9828d3ead6 ("mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE
THP allocations")

Thanks for pointing this out. I tested this on a recent mm-new built that
includes 9c9828d3ead6, and I found the compact_gap() change was still helpful.
My understanding is that 9c9828d3ead6 addresses direct reclaim for THP
allocations, while this patch affects the kswapd reclaim-compaction hand-off
path. The test runs still showed a benefit from capping the gap.

Unpatched (43 hosts)
pgscan_kswapd (mean/host): ~1.6M
reclaim efficiency (steal/scan): 83.8%
compaction success (success/stall): 2.1%
THP success (alloc/alloc+fallback): 4.9%
forced lru_add_drain (mean/host): ~107K

Patched (59 hosts)
pgscan_kswapd (mean/host): ~449K

Did the extra reclaim just disappear because we allow the allocations
to use
4MB more memory? Or it shifted to direct reclaim?

Specifically in the order-9 case, the reclaim target goes from 1024 to 32.
What the data shows is that capping the gap allows compaction to take over
sooner and start working to produce large size pages needed for THP. Whereas
in the pre-patch state, trying to reclaim the full 2x THP delays compaction.

So do I understand correctly we might have an issue due to lack of
hysteresis? We require reaching high watermark + compact_gap() to terminate
reclaim, but then compaction can find out we meanwhile dropped below that
(due to concurrent allocations) and it's not suitable again?

On an unpatched kernel in a fragmented environment, compaction_suitable() can
remain false because the effective threshold for costly orders is the low
watermark + the compact gap. Kswapd has to keep reclaiming in high order mode
as a result. By capping the gap at SWAP_CLUSTER_MAX, compaction becomes suitable
sooner and kswapd reaches the high order reclaim cutoff sooner. So with the patch,
kswapd is able to fall back to order-0 balancing earlier and wake up kcompactd
for the original high order request.

However the suitability checks e.g. compaction_zonelist_suitable() are using
min watermark, so that should provide the difference already.
Actually it's low watermark because of __compaction_suitable() adding an
extra low-min gap for costly orders. But still.

I did just notice compaction_ready() might be too strict. It wants
effectivly high wmark plus the gap plus the low-min difference. Is it
perhaps the underlying issue here?

It's a good point. It does seem like that's worth looking into, and I'd be
happy to explore that separately. My thought at the moment though is that
changing compaction_ready() would be a different direction from the the original
focus of this patch, which started with the realization that the compaction
scanner working set is bounded by COMPACT_CLUSTER_MAX. Since compact_gap() is
used in multiple reclaim and compaction decisions, including compaction_ready(),
fixing its definition seemed like the right first change if the gap itself is
oversized.

reclaim efficiency (steal/scan): 91.0%
compaction success (success/stall): 28.3%

Is this compaction success per compaction stall or per alloc stall?

That's per compaction.

THP success (alloc/alloc+fallback): 17.2%

Weird that things would improve that much. I would expect the free memory
just to stabilize around the lower gap but then behave similarly. Are we
missing something here?

This patch was tested in isolation, but also occurring was the case where
bursty net allocations reserve many pageblocks as high atomic. So as
THP-size pages become eligible, their blocks are reserved before being
allocated as THP.

forced lru_add_drain (mean/host): ~64K

Signed-off-by: JP Kobryn (Meta)<jp.kobryn@xxxxxxxxx>
---
include/linux/compaction.h | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 173d9c07a8952..09aea63b8a89d 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -2,6 +2,8 @@
#ifndef _LINUX_COMPACTION_H
#define _LINUX_COMPACTION_H
+#include <linux/swap.h>
+
/*
* Determines how hard direct compaction should try to succeed.
* Lower value means higher priority, analogically to reclaim priority.
@@ -73,11 +75,9 @@ static inline unsigned long compact_gap(unsigned
int order)
* effectively limited by COMPACT_CLUSTER_MAX, as that's the maximum
* that the migrate scanner can have isolated on migrate list, and free
* scanner is only invoked when the number of isolated free pages is
- * lower than that. But it's not worth to complicate the formula here
- * as a bigger gap for higher orders than strictly necessary can also
- * improve chances of compaction success.
+ * lower than that.
*/
- return 2UL << order;
+ return min(2UL << order, COMPACT_CLUSTER_MAX);

Shouldn't it at least be 2x COMPACT_CLUSTER_MAX?

I'm thinking I could reframe this patch as reclaim-focused and use
min(2UL << order, COMPACT_CLUSTER_MAX) as a reclaim-only target, while
either leaving the other non-reclaim users of this function alone or
using the 2x form you suggest above. i.e. I can split this function
into a separate reclaim_compact_gap() and use the originally proposed cap.
Thoughts?

Do I understand correctly you want to cap the reclaim target by
COMPACT_CLUSTER_MAX but leave e.g. the compaction_suitable() usage as it is?
But wouldn't that mean we'll actually make changes of passing
compaction_suitable() worse?

Good call. I was trying to find some middle ground, but I realize that the
change is better left unified.

Also, I tested a 2x COMPACT_CLUSTER_MAX cap and I saw mixed results - either
similar to this patch or worse, with no improvements over the
COMPACT_CLUSTER_MAX cap.