Re: [RFC PATCH v2] mm: Improve pgdat_balanced() to avoid over-reclamation for higher-order allocation

From: Barry Song

Date: Wed Apr 22 2026 - 06:56:43 EST

On Wed, Apr 22, 2026 at 2:59 PM Baolin Wang
<baolin.wang@xxxxxxxxxxxxxxxxx> wrote:
>
>
>
> On 4/22/26 10:18 AM, Barry Song (Xiaomi) wrote:
> > We may encounter cases where the system still has plenty of free
> > memory, but cannot satisfy higher-order allocations. On phones, we
> > have observed that bursty network transfers can cause devices to
> > heat up. Baolin and Kairui have seen similar behavior on servers.
> >
> > Currently, kswapd behaves as follows: when a higher-order allocation
> > is issued with __GFP_KSWAPD_RECLAIM, pgdat_balanced() returns false
> > because __zone_watermark_ok() fails if no suitable higher-order
> > pages exist, even when free memory is well above the high watermark.
> > As a result, kswapd_shrink_node() sets an excessively large
> > sc->nr_to_reclaim and attempts aggressive reclamation:
> >
> > for_each_managed_zone_pgdat(zone, pgdat, z, sc->reclaim_idx) {
> > sc->nr_to_reclaim += max(high_wmark_pages(zone), SWAP_CLUSTER_MAX);
> > }
> >
> > We have an opportunity to re-evaluate the balance by resetting
> > sc->order to 0 after shrink_node() with the following code
> > in kswapd_shrink_node():
> > /*
> > * Fragmentation may mean that the system cannot be rebalanced for
> > * high-order allocations. If twice the allocation size has been
> > * reclaimed then recheck watermarks only at order-0 to prevent
> > * excessive reclaim.
> > */
> > if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
> > sc->order = 0;
> >
> > But we have actually scanned and over-reclaimed far more than
> > compact_gap(sc->order). If higher-order allocations continue, we may
> > see persistently high kswapd CPU utilization coexisting with plenty of
> > free memory in the system.
> >
> > We may want to evaluate the situation earlier at the beginning.
> > If there is plenty of free memory, we could avoid triggering
> > reclamation with an excessively large sc->nr_to_reclaim value
> > and instead prefer compaction.
> >
> > Cc: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
> > Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
> > Cc: David Hildenbrand <david@xxxxxxxxxx>
> > Cc: Michal Hocko <mhocko@xxxxxxxxxx>
> > Cc: Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx>
> > Cc: Shakeel Butt <shakeel.butt@xxxxxxxxx>
> > Cc: Lorenzo Stoakes <ljs@xxxxxxxxxx>
> > Cc: Kairui Song <kasong@xxxxxxxxxxx>
> > Cc: Axel Rasmussen <axelrasmussen@xxxxxxxxxx>
> > Cc: Yuanchu Xie <yuanchu@xxxxxxxxxx>
> > Cc: Wei Xu <weixugc@xxxxxxxxxx>
> > Co-developed-by: Wang Lian <wanglian@xxxxxxxxxx>
> > Co-developed-by: Kunwu Chan <chentao@xxxxxxxxxx>
> > Signed-off-by: Barry Song (Xiaomi) <baohua@xxxxxxxxxx>
> > ---
>
> Thanks Barry for sending out the RFC patch for discussion.
>
> Yes, we have indeed seen reports from our customers' scenarios where
> fragmentation caused kswapd to be woken up and reclaim too many file
> folios (even when free memory was sufficient), leading to severe I/O
> contention that impacted some applications.
>
> However, I'm concerned that this patch might also have side effects,
> such as affecting system defragmentation. In some scenarios, directly
> reclaiming clean pagecache to free up space might be a faster way to

balance_pgdat() can still reclaim clean page cache even when
pgdat_balanced() returns true, provided that nr_boost_reclaim is
non-zero.

/*
* If boosting is not active then only reclaim if there are no
* eligible zones. Note that sc.reclaim_idx is not used as
* buffer_heads_over_limit may have adjusted it.
*/
if (!nr_boost_reclaim && balanced)
goto out;

/* Limit the priority of boosting to avoid reclaim writeback */
if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
raise_priority = false;

/*
* Do not writeback or swap pages for boosted reclaim. The
* intent is to relieve pressure not issue sub-optimal IO
* from reclaim context. If no pages are reclaimed, the
* reclaim will be aborted.
*/
sc.may_writepage = !nr_boost_reclaim;
sc.may_swap = !nr_boost_reclaim;

I find that nr_boost_reclaim is almost always non-zero in bursty
network scenarios. So I guess clean page cache is still reclaimed,
but with much lower kswapd pressure.

> defragment. At the very least, I think under defrag_mode, we should be
> more aggressive about defragmentation (including reclaiming some memory
> by kswapd).

I guess we can keep the current behavior if defrag_mode prefers
over-reclaiming to form contiguous pages. Is it simply an
if (defrag_mode) check?

>
> > -RFC v1 was "mm: net: disable kswapd for high-order network
> > buffer allocation":
> > https://lore.kernel.org/linux-mm/20251013101636.69220-1-21cnbao@xxxxxxxxx/
> >
> > mm/vmscan.c | 7 +++++++
> > 1 file changed, 7 insertions(+)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index bd1b1aa12581..4f9668aa8eef 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -6964,6 +6964,13 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx)
> > if (__zone_watermark_ok(zone, order, mark, highest_zoneidx,
> > 0, free_pages))
> > return true;
> > + /*
> > + * Free pages may be well above the watermark, but if
> > + * higher-order pages are unavailable, kswapd may still
> > + * trigger excessive reclamation.
> > + */
> > + if (order && compaction_suitable(zone, order, mark, highest_zoneidx))
> > + return true;
> > }
> >
> > /*
>

Thanks
Barry