Re: [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX

From: Vlastimil Babka (SUSE)

Date: Tue Jun 02 2026 - 04:47:21 EST

On 6/2/26 03:48, JP Kobryn wrote:
> On 5/28/26 1:51 AM, Vlastimil Babka (SUSE) wrote:
>> On 5/27/26 02:10, JP Kobryn wrote:
>>> On 5/25/26 3:02 AM, Vlastimil Babka (SUSE) wrote:
>>>> On 5/19/26 22:08, JP Kobryn (Meta) wrote:
>>>>> compact_gap() returns 2 << order, which is used as watermark headroom in
>>>>> __compaction_suitable() and as a reclaim target in kswapd. The computed
>>>>> value scales exponentially by order. For order-9 THP allocations this
>>>>> evaluates to 1024 pages, but the compaction free scanner's working set is
>>>>> bounded by COMPACT_CLUSTER_MAX (32 pages). The scanner stops
>>>>> isolating free
>>>>> pages once it matches the migration batch. The current gap
>>>>> over-reserves by
>>>>> 32x.
>>>>>
>>>>> On fragmented production hosts, kswapd will try and reclaim up to the
>>>>> gap,
>>>>> but it only reaches that threshold 18% of the time, causing reclaim to
>>>>> continue a majority of the time.
>>>> But doesn't that mean there's genuine memory pressure? We're effectively
>>>> raising the high watermark by 4 MB, but if processes are continuously
>>>> allocating, we'd be reclaiming without the gap as well? Unless the
>>>> workload
>>>> is sized to fit without the gap.
>>> It wasn't actual pressure, but the repetitive order-9 THP failures that were
>>> waking up kswapd. I should make this more clear in the changelog. After
>>> looking into why so much reclaim was occurring though, the compact gap stood
>>> out since it dictates the target amount to reclaim.
>> But the "amount to reclaim" is still defined as "reach high watermark +
>> compact_gap()" and not "reclaim at least compact_gap() pages" right? Or did
>> I miss something non-obvious.
> Within kswapd_shrink_node(), sc->nr_to_reclaim is the sum of max(zone high
> watermark or SWAP_CLUSTER_MAX) for each zone combined. The gap is not
> added to
> that reclaim target though. It's used afterward as the threshold for
> abandoning
> high order reclaim:
>
> if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
> sc->order = 0;
>
> balance_pgdat() then returns sc->order and that becomes the kswapd
> reclaim_order
> value, allowing this branch to be taken:
>
> if (reclaim_order < alloc_order)
> goto kswapd_try_sleep;
>
> Then in prepare_kswapd_sleep(), if pgdat_balanced() succeeds (at order-0),
> kcompactd is woken up for the original alloc_order (order-9).

Oh I see, thanks for explaining. I think it makes sense to target this
particular part (checking sc->nr_reclaimed) than change compact_gap()
globally then? It seems we have some mismatch in the various heuristics? IIUC:

- in shrink_node() we have a should_continue_reclaim() call, which will
return false as soon as compaction is suitable, but before that, we are
likely to not accumulate enough sc->nr_reclaimed, because sc->nr_to_reclaim
would be capped by SWAP_CLUSTER_MAX's

- thus we won't pass the sc->nr_reclaimed >= compact_gap check in
kswapd_shrink_node()

- balance_pgdat() will keep looping because we're not raising priority
(kswapd_shrink_node() returned a high order) and pgdat_balanced() is false
(it checks for high-order page availability)

Maybe only reduce the sc->nr_reclaimed threshold to 2*COMPACT_CLUSTER_MAX then?

>> So if kswapd did any work, it means the memory was consumed (i.e. there was
>> some memory pressure) and amount of free memory was below high watermark +
>> compact_gap()?
> Hmm but kswapd can be woken up on a high order failure despite plenty of
> lower
> order availability. That's really the case where compact_gap() matters for
> higher orders.

Ack, thanks.

> Unless by pressure you mean the high order pages were gone?

No.

>
>> BTW, are you using mglru here? (probably not)
>> As that might be different and I'm not so familiar with it.
> Using classic LRU.
>
>>>>> The over-sized gap also causes 46% of
>>>>> order-9 compaction suitability checks to fail unnecessarily - the
>>>>> zone has
>>>>> sufficient free pages for the scanner to operate, but not enough to clear
>>>>> the inflated threshold.
>>>>>
>>>>> Cap compact_gap() at COMPACT_CLUSTER_MAX to align the watermark headroom
>>>>> with the scanner's actual capacity. Orders 0-4 are unaffected since their
>>>>> gap is <= 32.
>>>>>
>>>>> A/B test on ~100 instagram production hosts (64GB, 60s measurement):
>>>> What was the base kernel version?
>>> 6.13. Additional benchmarks were done using a recent mm-new build as well,
>>> and they showed similar reductions in reclaim.
>> If it's a NUMA machine, we recently found an over-reclaim issue there fixed
>> by 9c9828d3ead6 ("mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE
>> THP allocations")
> Thanks for pointing this out. I tested this on a recent mm-new built that
> includes 9c9828d3ead6, and I found the compact_gap() change was still
> helpful.

OK.

> My understanding is that 9c9828d3ead6 addresses direct reclaim for THP
> allocations, while this patch affects the kswapd reclaim-compaction hand-off
> path. The test runs still showed a benefit from capping the gap.

Yep.

>>>>> Unpatched (43 hosts)
>>>>> pgscan_kswapd (mean/host): ~1.6M
>>>>> reclaim efficiency (steal/scan): 83.8%
>>>>> compaction success (success/stall): 2.1%
>>>>> THP success (alloc/alloc+fallback): 4.9%
>>>>> forced lru_add_drain (mean/host): ~107K
>>>>>
>>>>> Patched (59 hosts)
>>>>> pgscan_kswapd (mean/host): ~449K
>>>> Did the extra reclaim just disappear because we allow the allocations
>>>> to use
>>>> 4MB more memory? Or it shifted to direct reclaim?
>>> Specifically in the order-9 case, the reclaim target goes from 1024 to 32.
>>> What the data shows is that capping the gap allows compaction to take over
>>> sooner and start working to produce large size pages needed for THP. Whereas
>>> in the pre-patch state, trying to reclaim the full 2x THP delays compaction.
>> So do I understand correctly we might have an issue due to lack of
>> hysteresis? We require reaching high watermark + compact_gap() to terminate
>> reclaim, but then compaction can find out we meanwhile dropped below that
>> (due to concurrent allocations) and it's not suitable again?
> On an unpatched kernel in a fragmented environment,
> compaction_suitable() can
> remain false because the effective threshold for costly orders is the low
> watermark + the compact gap. Kswapd has to keep reclaiming in high order
> mode
> as a result.

I think this part might be ok.

> By capping the gap at SWAP_CLUSTER_MAX, compaction becomes
> suitable
> sooner and kswapd reaches the high order reclaim cutoff sooner. So with

And the problem is with the cutoff only which is not based on a
watermark+gap threshold but wants to reclaim at least gap pages regardless
of how many pages are already free.

> the patch,
> kswapd is able to fall back to order-0 balancing earlier and wake up
> kcompactd
> for the original high order request.

Yeah that's likely the crucial part.

>> However the suitability checks e.g. compaction_zonelist_suitable() are using
>> min watermark, so that should provide the difference already.
>> Actually it's low watermark because of __compaction_suitable() adding an
>> extra low-min gap for costly orders. But still.
>>
>> I did just notice compaction_ready() might be too strict. It wants
>> effectivly high wmark plus the gap plus the low-min difference. Is it
>> perhaps the underlying issue here?
> It's a good point. It does seem like that's worth looking into, and I'd be
> happy to explore that separately.My thought at the moment though is that
> changing compaction_ready() would be a different direction from the the
> original
> focus of this patch, which started with the realization that the compaction
> scanner working set is bounded by COMPACT_CLUSTER_MAX. Since

Ack.

> compact_gap() is
> used in multiple reclaim and compaction decisions, including
> compaction_ready(),
> fixing its definition seemed like the right first change if the gap
> itself is
> oversized.

Still I'd try to address the sc->nr_reclaimed usage first, and see if that's
enough.

>>>>> reclaim efficiency (steal/scan): 91.0%
>>>>> compaction success (success/stall): 28.3%
>>>> Is this compaction success per compaction stall or per alloc stall?
>>> That's per compaction.
>>>
>>>>> THP success (alloc/alloc+fallback): 17.2%
>>>> Weird that things would improve that much. I would expect the free memory
>>>> just to stabilize around the lower gap but then behave similarly. Are we
>>>> missing something here?
>>> This patch was tested in isolation, but also occurring was the case where
>>> bursty net allocations reserve many pageblocks as high atomic. So as
>>> THP-size pages become eligible, their blocks are reserved before being
>>> allocated as THP.
>>>
>>>>> forced lru_add_drain (mean/host): ~64K
>>>>>
>>>>> Signed-off-by: JP Kobryn (Meta)<jp.kobryn@xxxxxxxxx>
>>>>> ---
>>>>> include/linux/compaction.h | 8 ++++----
>>>>> 1 file changed, 4 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
>>>>> index 173d9c07a8952..09aea63b8a89d 100644
>>>>> --- a/include/linux/compaction.h
>>>>> +++ b/include/linux/compaction.h
>>>>> @@ -2,6 +2,8 @@
>>>>> #ifndef _LINUX_COMPACTION_H
>>>>> #define _LINUX_COMPACTION_H
>>>>> +#include <linux/swap.h>
>>>>> +
>>>>> /*
>>>>> * Determines how hard direct compaction should try to succeed.
>>>>> * Lower value means higher priority, analogically to reclaim priority.
>>>>> @@ -73,11 +75,9 @@ static inline unsigned long compact_gap(unsigned
>>>>> int order)
>>>>> * effectively limited by COMPACT_CLUSTER_MAX, as that's the maximum
>>>>> * that the migrate scanner can have isolated on migrate list, and free
>>>>> * scanner is only invoked when the number of isolated free pages is
>>>>> - * lower than that. But it's not worth to complicate the formula here
>>>>> - * as a bigger gap for higher orders than strictly necessary can also
>>>>> - * improve chances of compaction success.
>>>>> + * lower than that.
>>>>> */
>>>>> - return 2UL << order;
>>>>> + return min(2UL << order, COMPACT_CLUSTER_MAX);
>>>> Shouldn't it at least be 2x COMPACT_CLUSTER_MAX?
>>> I'm thinking I could reframe this patch as reclaim-focused and use
>>> min(2UL << order, COMPACT_CLUSTER_MAX) as a reclaim-only target, while
>>> either leaving the other non-reclaim users of this function alone or
>>> using the 2x form you suggest above. i.e. I can split this function
>>> into a separate reclaim_compact_gap() and use the originally proposed cap.
>>> Thoughts?
>> Do I understand correctly you want to cap the reclaim target by
>> COMPACT_CLUSTER_MAX but leave e.g. the compaction_suitable() usage as it is?
>> But wouldn't that mean we'll actually make changes of passing
>> compaction_suitable() worse?
> Good call. I was trying to find some middle ground, but I realize that the
> change is better left unified.

My question was based on not understanding the underlying issue, and that
the "reclaim-only target" isn't based on watermark+gap but "reclaim gap
worth of pages". Now I think sc->nr_reclaimed is indeed the check that
should be relaxed first.

> Also, I tested a 2x COMPACT_CLUSTER_MAX cap and I saw mixed results - either
> similar to this patch or worse, with no improvements over the
> COMPACT_CLUSTER_MAX cap.

Right. Thanks!