Re: [PATCH] mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX
From: Vlastimil Babka (SUSE)
Date: Wed Jun 03 2026 - 04:33:50 EST
On 6/3/26 09:15, JP Kobryn wrote:
> On 6/2/26 1:40 AM, Vlastimil Babka (SUSE) wrote:
>> On 6/2/26 03:48, JP Kobryn wrote:
>>> On 5/28/26 1:51 AM, Vlastimil Babka (SUSE) wrote:
>>>> On 5/27/26 02:10, JP Kobryn wrote:
>>>>> On 5/25/26 3:02 AM, Vlastimil Babka (SUSE) wrote:
>>>>>> On 5/19/26 22:08, JP Kobryn (Meta) wrote:
>>>>>>> compact_gap() returns 2 << order, which is used as watermark headroom in
>>>>>>> __compaction_suitable() and as a reclaim target in kswapd. The computed
>>>>>>> value scales exponentially by order. For order-9 THP allocations this
>>>>>>> evaluates to 1024 pages, but the compaction free scanner's working set is
>>>>>>> bounded by COMPACT_CLUSTER_MAX (32 pages). The scanner stops
>>>>>>> isolating free
>>>>>>> pages once it matches the migration batch. The current gap
>>>>>>> over-reserves by
>>>>>>> 32x.
>>>>>>>
>>>>>>> On fragmented production hosts, kswapd will try and reclaim up to the
>>>>>>> gap,
>>>>>>> but it only reaches that threshold 18% of the time, causing reclaim to
>>>>>>> continue a majority of the time.
>>>>>> But doesn't that mean there's genuine memory pressure? We're effectively
>>>>>> raising the high watermark by 4 MB, but if processes are continuously
>>>>>> allocating, we'd be reclaiming without the gap as well? Unless the
>>>>>> workload
>>>>>> is sized to fit without the gap.
>>>>> It wasn't actual pressure, but the repetitive order-9 THP failures that were
>>>>> waking up kswapd. I should make this more clear in the changelog. After
>>>>> looking into why so much reclaim was occurring though, the compact gap stood
>>>>> out since it dictates the target amount to reclaim.
>>>> But the "amount to reclaim" is still defined as "reach high watermark +
>>>> compact_gap()" and not "reclaim at least compact_gap() pages" right? Or did
>>>> I miss something non-obvious.
>>> Within kswapd_shrink_node(), sc->nr_to_reclaim is the sum of max(zone high
>>> watermark or SWAP_CLUSTER_MAX) for each zone combined. The gap is not
>>> added to
>>> that reclaim target though. It's used afterward as the threshold for
>>> abandoning
>>> high order reclaim:
>>>
>>> if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
>>> sc->order = 0;
>>>
>>> balance_pgdat() then returns sc->order and that becomes the kswapd
>>> reclaim_order
>>> value, allowing this branch to be taken:
>>>
>>> if (reclaim_order < alloc_order)
>>> goto kswapd_try_sleep;
>>>
>>> Then in prepare_kswapd_sleep(), if pgdat_balanced() succeeds (at order-0),
>>> kcompactd is woken up for the original alloc_order (order-9).
>>
>> Oh I see, thanks for explaining. I think it makes sense to target this
>> particular part (checking sc->nr_reclaimed) than change compact_gap()
>> globally then? It seems we have some mismatch in the various heuristics? IIUC:
>
> I gave this a try and got some interesting results. Based on mm-new as
> of earlier today, I ran three variations: original compact_gap (2 <<
> order), capped compact_gap (this patch), and capped downgrade gate which
> has the original compact_gap (2 << order) but caps within
> kswapd_shrink_node():
>
> - if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order)) {
> + if (sc->order && sc->nr_reclaimed >=
> + min(compact_gap(sc->order), SWAP_CLUSTER_MAX)) {
> sc->order = 0;
> }
>
> The new approach showed improvements in THP allocations.
>
> thp_fault_fallback
> original gap: 1217
> capped gap (global): 738
> capped gap at downgrade gate: 898
>
> More details are below.
>
>>
>> - in shrink_node() we have a should_continue_reclaim() call, which will
>> return false as soon as compaction is suitable, but before that, we are
>> likely to not accumulate enough sc->nr_reclaimed, because sc->nr_to_reclaim
>> would be capped by SWAP_CLUSTER_MAX's
>>
>> - thus we won't pass the sc->nr_reclaimed >= compact_gap check in
>> kswapd_shrink_node()
>>
>> - balance_pgdat() will keep looping because we're not raising priority
>> (kswapd_shrink_node() returned a high order) and pgdat_balanced() is false
>> (it checks for high-order page availability)
>
> I added some temporary tracepoints to verify paths taken. The average
> hits across three 60s runs are shown below.
>
> kswapd_shrink_node downgrade to order-0
> original gap: 0
> capped gap (global, this patch): 28
> capped gap at downgrade gate: 80
>
> So the downgrades are more frequent, but the suggested approach
> regressed harder in terms of reclaim.
>
> pgscan_kswapd
> original gap: 6328
> capped gap (global, this patch): 3773
> capped gap at downgrade gate: 7988
>
> pgsteal_kswapd
> original gap: 5657
> capped gap (global, this patch): 3243
> capped gap at downgrade gate: 7101
>
> This is because the suitability checks are still using the inflated gap
> causing the split below.
> kswapd_shrink_node() gap: 32
> __compaction_suitable() gap: 1024
>
> So it seems that capping globally (this patch) is the better option to
> avoid the split above which causes unnecessary reclaim.
OK that's convincing and thanks a lot for doing that. Could you summarize
this in the changelog as well? Thanks!