Re: [PATCH v2 1/5] mm: Introduce zone_appears_fragmented()

From: Thomas Hellström

Date: Fri Apr 24 2026 - 03:08:57 EST


On Thu, 2026-04-23 at 15:21 -0700, Matthew Brost wrote:
> On Thu, Apr 23, 2026 at 12:08:36PM -0700, Matthew Brost wrote:
> > On Thu, Apr 23, 2026 at 01:27:11PM +0200, Thomas Hellström wrote:
> > > On Thu, 2026-04-23 at 12:27 +0200, David Hildenbrand (Arm) wrote:
> > > > On 4/23/26 07:56, Matthew Brost wrote:
> > > > > Introduce zone_appears_fragmented() as a lightweight helper
> > > > > to
> > > > > allow
> > > > > subsystems to make coarse decisions about reclaim behavior in
> > > > > the
> > > > > presence of likely fragmentation.
> > > > >
> > > > > The helper implements a simple heuristic: if the number of
> > > > > free
> > > > > pages
> > > > > in a zone exceeds twice the high watermark, the zone is
> > > > > considered
> > > > > to
> > > > > have ample free memory and allocation failures are more
> > > > > likely due
> > > > > to
> > > > > fragmentation than overall memory pressure.
> > > > >
> > > > > This is intentionally imprecise and is not meant to replace
> > > > > the
> > > > > core
> > > > > MM compaction or fragmentation accounting logic. Instead, it
> > > > > provides
> > > > > a cheap signal for callers (e.g., shrinkers) that wish to
> > > > > avoid
> > > > > overly aggressive reclaim when sufficient free memory exists
> > > > > but
> > > > > high-order allocations may still fail.
> > > > >
> > > > > No functional changes; this is a preparatory helper for
> > > > > future
> > > > > users.
> > > > >
> > > > > Cc: Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx>
> > > > > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> > > > > Cc: David Hildenbrand <david@xxxxxxxxxx>
> > > > > Cc: Lorenzo Stoakes <ljs@xxxxxxxxxx>
> > > > > Cc: "Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx>
> > > > > Cc: Vlastimil Babka <vbabka@xxxxxxxxxx>
> > > > > Cc: Mike Rapoport <rppt@xxxxxxxxxx>
> > > > > Cc: Suren Baghdasaryan <surenb@xxxxxxxxxx>
> > > > > Cc: Michal Hocko <mhocko@xxxxxxxx>
> > > > > Cc: linux-mm@xxxxxxxxx
> > > > > Cc: linux-kernel@xxxxxxxxxxxxxxx
> > > > > Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx>
> > > > > ---
> > > > >  include/linux/vmstat.h | 13 +++++++++++++
> > > > >  1 file changed, 13 insertions(+)
> > > > >
> > > > > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > > > > index 3c9c266cf782..568d9f4f1a1f 100644
> > > > > --- a/include/linux/vmstat.h
> > > > > +++ b/include/linux/vmstat.h
> > > > > @@ -483,6 +483,19 @@ static inline const char
> > > > > *zone_stat_name(enum
> > > > > zone_stat_item item)
> > > > >   return vmstat_text[item];
> > > > >  }
> > > > >  
> > > > > +static inline bool zone_appears_fragmented(struct zone
> > > > > *zone)
> > > > > +{
> > > >
> > > > "zone_likely_fragmented" or "zone_maybe_fragmented" might be
> > > > clearer,
> > > > depending
> > > > on the actual semantics.
> > > >
> > > > > + /*
> > > > > + * Simple heuristic: if the number of free pages is
> > > > > more
> > > > > than twice the
> > > > > + * high watermark, this strongly suggests that the
> > > > > zone is
> > > > > heavily
> > > > > + * fragmented when called from a shrinker.
> > > > > + */
> > > >
> > > > I'll cc some more people. But the "when called from a shrinker"
> > > > bit
> > > > is
> > > > concerning. Are there additional semantics that should be
> > > > expressed
> > > > in the
> > > > function name, for example?
> > > >
> > > > Something that implies that this function only gives you a
> > > > reasonable
> > > > answer in
> > > > a certain context.
> > >
> > > I think that test would not be relevant for cgroup-aware
> > > shrinking.
> > >
> > > What about trying to pass something in the struct shrink_control?
> > > Like
> > > if we pass the struct scan_control's order field also in struct
> >
> > If the order were included in shrink_control, there is about a 95%
> > certain that this change would allow TTM / Xe to break the
> > problematic
> > kswapd feedback loop. This may also better express the intent of
> > the
> > problem we are trying to fix here.
> >
> > For reference, the cover letter [1] details the problem.
> >
> > Any guidance from the core MM folks would be appreciated—would
> > adding
> > the order to shrink_control be an acceptable solution?
> >
> > Matt
> >
> > [1] https://patchwork.freedesktop.org/series/165330/
> >
> > > shrink_control, really expensive shrinkers could duck reclaim
> > > attempts
> > > from higher-order allocations that may fail anyway:
> > >
> > >       if (sc->order > PAGE_ALLOC_COSTLY_ORDER &&
> > >            (sc->gfp_mask & (__GFP_NORETRY | __GFP_RETRY_MAYFAIL))
> > > &&
> > >            !(sc->gfp_mask & __GFP_NOFAIL))
>
> It doesn't look like __GFP_NORETRY, __GFP_RETRY_MAYFAIL, __GFP_NOFAIL
> make it to the sc->gfp_mask flags from the caller and get into kswapd
> loop...

Perhaps that's because they mostly (only?) make sense from direct
reclaim? Looks like the trace is from kswapd.

Another metric to weigh in is perhaps the scan_control::priority field.
>From my understanding it is progressively decreased towards 0 with 0
indicating most urgent shrinking.

Thanks,
Thomas

>
>  1182 [  394.049058] xe_shrinker_scan: no skip order=9,
> gfp=0x0000000000000cc0
>  1183 [  394.049061] CPU: 2 UID: 0 PID: 110 Comm: kswapd0 Not tainted
> 7.0.0-xe+ #355 PREEMPT(full)
>  1184 [  394.049062] Hardware name: Intel Corporation Panther Lake
> Client Platform/PTL-UH LP5 T3 RVP1, BIOS
> PTLPFWI1.R00.3332.D05.2509011438 09/01/2025
>  1185 [  394.049063] Call Trace:
>  1186 [  394.049065]  <TASK>
>  1187 [  394.049066]  dump_stack_lvl+0x55/0x70
>  1188 [  394.049073]  xe_shrinker_scan+0x274/0x280 [xe]
>  1189 [  394.049181]  do_shrink_slab+0x132/0x360
>  1190 [  394.049184]  shrink_slab+0xf0/0x3e0
>  1191 [  394.049186]  shrink_node+0x2bd/0x800
>  1192 [  394.049188]  balance_pgdat+0x323/0x760
>  1193 [  394.049189]  kswapd+0x1c3/0x340
>  1194 [  394.049190]  ? __pfx_autoremove_wake_function+0x10/0x10
>  1195 [  394.049193]  ? __pfx_kswapd+0x10/0x10
>  1196 [  394.049194]  kthread+0xdf/0x120
>  1197 [  394.049196]  ? __pfx_kthread+0x10/0x10
>  1198 [  394.049197]  ret_from_fork+0x1d0/0x220
>  1199 [  394.049200]  ? __pfx_kthread+0x10/0x10
>  1200 [  394.049200]  ret_from_fork_asm+0x1a/0x30
>  1201 [  394.049202]  </TASK>
>
> Will look into if this is fixable, but again any core MM guidance
> would
> helpful.
>
> Matt
>
> > >            return SHRINK_STOP;
> > >
> > > Possibly exposed as an inline helper in the shrinker interface?
> > >
> > > /Thomas
> > >
> > >
> > >
> > >