Re: [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX

From: JP Kobryn

Date: Wed Jun 03 2026 - 03:26:01 EST

On 6/2/26 1:40 AM, Vlastimil Babka (SUSE) wrote:
> On 6/2/26 03:48, JP Kobryn wrote:
>> On 5/28/26 1:51 AM, Vlastimil Babka (SUSE) wrote:
>>> On 5/27/26 02:10, JP Kobryn wrote:
>>>> On 5/25/26 3:02 AM, Vlastimil Babka (SUSE) wrote:
>>>>> On 5/19/26 22:08, JP Kobryn (Meta) wrote:
>>>>>> compact_gap() returns 2 << order, which is used as watermark headroom in
>>>>>> __compaction_suitable() and as a reclaim target in kswapd. The computed
>>>>>> value scales exponentially by order. For order-9 THP allocations this
>>>>>> evaluates to 1024 pages, but the compaction free scanner's working set is
>>>>>> bounded by COMPACT_CLUSTER_MAX (32 pages). The scanner stops
>>>>>> isolating free
>>>>>> pages once it matches the migration batch. The current gap
>>>>>> over-reserves by
>>>>>> 32x.
>>>>>>
>>>>>> On fragmented production hosts, kswapd will try and reclaim up to the
>>>>>> gap,
>>>>>> but it only reaches that threshold 18% of the time, causing reclaim to
>>>>>> continue a majority of the time.
>>>>> But doesn't that mean there's genuine memory pressure? We're effectively
>>>>> raising the high watermark by 4 MB, but if processes are continuously
>>>>> allocating, we'd be reclaiming without the gap as well? Unless the
>>>>> workload
>>>>> is sized to fit without the gap.
>>>> It wasn't actual pressure, but the repetitive order-9 THP failures that were
>>>> waking up kswapd. I should make this more clear in the changelog. After
>>>> looking into why so much reclaim was occurring though, the compact gap stood
>>>> out since it dictates the target amount to reclaim.
>>> But the "amount to reclaim" is still defined as "reach high watermark +
>>> compact_gap()" and not "reclaim at least compact_gap() pages" right? Or did
>>> I miss something non-obvious.
>> Within kswapd_shrink_node(), sc->nr_to_reclaim is the sum of max(zone high
>> watermark or SWAP_CLUSTER_MAX) for each zone combined. The gap is not
>> added to
>> that reclaim target though. It's used afterward as the threshold for
>> abandoning
>> high order reclaim:
>>
>> if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
>> sc->order = 0;
>>
>> balance_pgdat() then returns sc->order and that becomes the kswapd
>> reclaim_order
>> value, allowing this branch to be taken:
>>
>> if (reclaim_order < alloc_order)
>> goto kswapd_try_sleep;
>>
>> Then in prepare_kswapd_sleep(), if pgdat_balanced() succeeds (at order-0),
>> kcompactd is woken up for the original alloc_order (order-9).
>
> Oh I see, thanks for explaining. I think it makes sense to target this
> particular part (checking sc->nr_reclaimed) than change compact_gap()
> globally then? It seems we have some mismatch in the various heuristics? IIUC:

I gave this a try and got some interesting results. Based on mm-new as
of earlier today, I ran three variations: original compact_gap (2 <<
order), capped compact_gap (this patch), and capped downgrade gate which
has the original compact_gap (2 << order) but caps within
kswapd_shrink_node():

- if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order)) {
+ if (sc->order && sc->nr_reclaimed >=
+ min(compact_gap(sc->order), SWAP_CLUSTER_MAX)) {
sc->order = 0;
}

The new approach showed improvements in THP allocations.

thp_fault_fallback
original gap: 1217
capped gap (global): 738
capped gap at downgrade gate: 898

More details are below.

>
> - in shrink_node() we have a should_continue_reclaim() call, which will
> return false as soon as compaction is suitable, but before that, we are
> likely to not accumulate enough sc->nr_reclaimed, because sc->nr_to_reclaim
> would be capped by SWAP_CLUSTER_MAX's
>
> - thus we won't pass the sc->nr_reclaimed >= compact_gap check in
> kswapd_shrink_node()
>
> - balance_pgdat() will keep looping because we're not raising priority
> (kswapd_shrink_node() returned a high order) and pgdat_balanced() is false
> (it checks for high-order page availability)

I added some temporary tracepoints to verify paths taken. The average
hits across three 60s runs are shown below.

kswapd_shrink_node downgrade to order-0
original gap: 0
capped gap (global, this patch): 28
capped gap at downgrade gate: 80

So the downgrades are more frequent, but the suggested approach
regressed harder in terms of reclaim.

pgscan_kswapd
original gap: 6328
capped gap (global, this patch): 3773
capped gap at downgrade gate: 7988

pgsteal_kswapd
original gap: 5657
capped gap (global, this patch): 3243
capped gap at downgrade gate: 7101

This is because the suitability checks are still using the inflated gap
causing the split below.
kswapd_shrink_node() gap: 32
__compaction_suitable() gap: 1024

So it seems that capping globally (this patch) is the better option to
avoid the split above which causes unnecessary reclaim.

>
> Maybe only reduce the sc->nr_reclaimed threshold to 2*COMPACT_CLUSTER_MAX then?

This was considered as well. But with the new findings, it would imply
even more reclaim.

>
>>> So if kswapd did any work, it means the memory was consumed (i.e. there was
>>> some memory pressure) and amount of free memory was below high watermark +
>>> compact_gap()?
>> Hmm but kswapd can be woken up on a high order failure despite plenty of
>> lower
>> order availability. That's really the case where compact_gap() matters for
>> higher orders.
>
> Ack, thanks.
>
>> Unless by pressure you mean the high order pages were gone?
>
> No.
>
>>
>>> BTW, are you using mglru here? (probably not)
>>> As that might be different and I'm not so familiar with it.
>> Using classic LRU.
>>
>>>>>> The over-sized gap also causes 46% of
>>>>>> order-9 compaction suitability checks to fail unnecessarily - the
>>>>>> zone has
>>>>>> sufficient free pages for the scanner to operate, but not enough to clear
>>>>>> the inflated threshold.
>>>>>>
>>>>>> Cap compact_gap() at COMPACT_CLUSTER_MAX to align the watermark headroom
>>>>>> with the scanner's actual capacity. Orders 0-4 are unaffected since their
>>>>>> gap is <= 32.
>>>>>>
>>>>>> A/B test on ~100 instagram production hosts (64GB, 60s measurement):
>>>>> What was the base kernel version?
>>>> 6.13. Additional benchmarks were done using a recent mm-new build as well,
>>>> and they showed similar reductions in reclaim.
>>> If it's a NUMA machine, we recently found an over-reclaim issue there fixed
>>> by 9c9828d3ead6 ("mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE
>>> THP allocations")
>> Thanks for pointing this out. I tested this on a recent mm-new built that
>> includes 9c9828d3ead6, and I found the compact_gap() change was still
>> helpful.
>
> OK.
>
>> My understanding is that 9c9828d3ead6 addresses direct reclaim for THP
>> allocations, while this patch affects the kswapd reclaim-compaction hand-off
>> path. The test runs still showed a benefit from capping the gap.
>
> Yep.
>
>>>>>> Unpatched (43 hosts)
>>>>>> pgscan_kswapd (mean/host): ~1.6M
>>>>>> reclaim efficiency (steal/scan): 83.8%
>>>>>> compaction success (success/stall): 2.1%
>>>>>> THP success (alloc/alloc+fallback): 4.9%
>>>>>> forced lru_add_drain (mean/host): ~107K
>>>>>>
>>>>>> Patched (59 hosts)
>>>>>> pgscan_kswapd (mean/host): ~449K
>>>>> Did the extra reclaim just disappear because we allow the allocations
>>>>> to use
>>>>> 4MB more memory? Or it shifted to direct reclaim?
>>>> Specifically in the order-9 case, the reclaim target goes from 1024 to 32.
>>>> What the data shows is that capping the gap allows compaction to take over
>>>> sooner and start working to produce large size pages needed for THP. Whereas
>>>> in the pre-patch state, trying to reclaim the full 2x THP delays compaction.
>>> So do I understand correctly we might have an issue due to lack of
>>> hysteresis? We require reaching high watermark + compact_gap() to terminate
>>> reclaim, but then compaction can find out we meanwhile dropped below that
>>> (due to concurrent allocations) and it's not suitable again?
>> On an unpatched kernel in a fragmented environment,
>> compaction_suitable() can
>> remain false because the effective threshold for costly orders is the low
>> watermark + the compact gap. Kswapd has to keep reclaiming in high order
>> mode
>> as a result.
>
> I think this part might be ok.
>
>> By capping the gap at SWAP_CLUSTER_MAX, compaction becomes
>> suitable
>> sooner and kswapd reaches the high order reclaim cutoff sooner. So with
>
> And the problem is with the cutoff only which is not based on a
> watermark+gap threshold but wants to reclaim at least gap pages regardless
> of how many pages are already free.
>
>> the patch,
>> kswapd is able to fall back to order-0 balancing earlier and wake up
>> kcompactd
>> for the original high order request.
>
> Yeah that's likely the crucial part.
>
>>> However the suitability checks e.g. compaction_zonelist_suitable() are using
>>> min watermark, so that should provide the difference already.
>>> Actually it's low watermark because of __compaction_suitable() adding an
>>> extra low-min gap for costly orders. But still.
>>>
>>> I did just notice compaction_ready() might be too strict. It wants
>>> effectivly high wmark plus the gap plus the low-min difference. Is it
>>> perhaps the underlying issue here?
>> It's a good point. It does seem like that's worth looking into, and I'd be
>> happy to explore that separately.My thought at the moment though is that
>> changing compaction_ready() would be a different direction from the the
>> original
>> focus of this patch, which started with the realization that the compaction
>> scanner working set is bounded by COMPACT_CLUSTER_MAX. Since
>
> Ack.
>
>> compact_gap() is
>> used in multiple reclaim and compaction decisions, including
>> compaction_ready(),
>> fixing its definition seemed like the right first change if the gap
>> itself is
>> oversized.
>
> Still I'd try to address the sc->nr_reclaimed usage first, and see if that's
> enough.

Let me know if I covered that above.

>
>>>>>> reclaim efficiency (steal/scan): 91.0%
>>>>>> compaction success (success/stall): 28.3%
>>>>> Is this compaction success per compaction stall or per alloc stall?
>>>> That's per compaction.
>>>>
>>>>>> THP success (alloc/alloc+fallback): 17.2%
>>>>> Weird that things would improve that much. I would expect the free memory
>>>>> just to stabilize around the lower gap but then behave similarly. Are we
>>>>> missing something here?
>>>> This patch was tested in isolation, but also occurring was the case where
>>>> bursty net allocations reserve many pageblocks as high atomic. So as
>>>> THP-size pages become eligible, their blocks are reserved before being
>>>> allocated as THP.
>>>>
>>>>>> forced lru_add_drain (mean/host): ~64K
>>>>>>
>>>>>> Signed-off-by: JP Kobryn (Meta)<jp.kobryn@xxxxxxxxx>
>>>>>> ---
>>>>>> include/linux/compaction.h | 8 ++++----
>>>>>> 1 file changed, 4 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
>>>>>> index 173d9c07a8952..09aea63b8a89d 100644
>>>>>> --- a/include/linux/compaction.h
>>>>>> +++ b/include/linux/compaction.h
>>>>>> @@ -2,6 +2,8 @@
>>>>>> #ifndef _LINUX_COMPACTION_H
>>>>>> #define _LINUX_COMPACTION_H
>>>>>> +#include <linux/swap.h>
>>>>>> +
>>>>>> /*
>>>>>> * Determines how hard direct compaction should try to succeed.
>>>>>> * Lower value means higher priority, analogically to reclaim priority.
>>>>>> @@ -73,11 +75,9 @@ static inline unsigned long compact_gap(unsigned
>>>>>> int order)
>>>>>> * effectively limited by COMPACT_CLUSTER_MAX, as that's the maximum
>>>>>> * that the migrate scanner can have isolated on migrate list, and free
>>>>>> * scanner is only invoked when the number of isolated free pages is
>>>>>> - * lower than that. But it's not worth to complicate the formula here
>>>>>> - * as a bigger gap for higher orders than strictly necessary can also
>>>>>> - * improve chances of compaction success.
>>>>>> + * lower than that.
>>>>>> */
>>>>>> - return 2UL << order;
>>>>>> + return min(2UL << order, COMPACT_CLUSTER_MAX);
>>>>> Shouldn't it at least be 2x COMPACT_CLUSTER_MAX?
>>>> I'm thinking I could reframe this patch as reclaim-focused and use
>>>> min(2UL << order, COMPACT_CLUSTER_MAX) as a reclaim-only target, while
>>>> either leaving the other non-reclaim users of this function alone or
>>>> using the 2x form you suggest above. i.e. I can split this function
>>>> into a separate reclaim_compact_gap() and use the originally proposed cap.
>>>> Thoughts?
>>> Do I understand correctly you want to cap the reclaim target by
>>> COMPACT_CLUSTER_MAX but leave e.g. the compaction_suitable() usage as it is?
>>> But wouldn't that mean we'll actually make changes of passing
>>> compaction_suitable() worse?
>> Good call. I was trying to find some middle ground, but I realize that the
>> change is better left unified.
>
> My question was based on not understanding the underlying issue, and that
> the "reclaim-only target" isn't based on watermark+gap but "reclaim gap
> worth of pages". Now I think sc->nr_reclaimed is indeed the check that
> should be relaxed first.

Hopefully the data helps this time around.