Re: [PATCH v6 04/14] mm/mglru: restructure the reclaim loop

From: Kairui Song

Date: Fri Apr 24 2026 - 13:04:24 EST


Forwarding sashiko review:

> commit b0a64f8c3f76ce5cebb30c1caa6305fc617888cf
> Author: Kairui Song <kasong@xxxxxxxxxxx>
> mm/mglru: restructure the reclaim loop
> The current loop calculates the scan number on each iteration, which couples
> it with aging and rotation. This patch decouples aging and rotation by
> calculating the scan number once at the beginning of reclaim, respecting
> reclaim priority, and making aging explicit.
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 3c5a6ae9244099..757beb605980d4 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> [ ... ]
> > @@ -4940,11 +4932,10 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc,
> > return nr_to_scan;
> >
> > nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
> > + nr_to_scan >>= sc->priority;
> >
> > - /* try to get away with not aging at the default priority */
> > - if (!need_aging || sc->priority == DEF_PRIORITY)
> > - return nr_to_scan >> sc->priority;
> > + if (!nr_to_scan && sc->priority < DEF_PRIORITY)
> > + nr_to_scan = min(evictable, SWAP_CLUSTER_MAX);
> Does this code introduce a priority inversion where higher reclaim priorities
> result in fewer pages being scanned than lower priorities?
> If a small memcg has an evictable size of 100 and SWAP_CLUSTER_MAX is 32,
> at sc->priority == 7, 100 >> 7 is 0, so the fallback sets nr_to_scan to 32.
> At sc->priority == 6 (higher urgency), 100 >> 6 is 1. The fallback is
> bypassed and nr_to_scan remains 1.
> Would it be better for this fallback to apply a floor regardless of whether
> the shifted result is exactly zero?

Right, sashiko has been very concerned about this part for many times,
it wasn't happy about either design (unconditional floor, no handling
at all, or this version). But this is not a functional regression.

For tiny cgroups the old code before this patch could be over reclaiming.
Now the over reclaiming is no more but could be a trivial priority
inversion, which I don't think that's a practically observable problem.

But to be academically ideal, following change should better:

nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
nr_to_scan >>= sc->priority;

/* make escalated reclaim progress and balance with slab shrinking */
if (sc->priority < DEF_PRIORITY)
nr_to_scan = max(nr_to_scan, min(evictable, SWAP_CLUSTER_MAX));

No inversion, and no under reclaim or over reclaim.

> [ ... ]
> > @@ -4985,31 +4977,44 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
> > return true;
> > }
> >
> > +/*
> > + * For future optimizations:
> > + * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
> > + * reclaim.
> > + */
> > static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> > {
> > + bool need_rotate = false;
> > long nr_batch, nr_to_scan;
> > - unsigned long scanned = 0;
> > int swappiness = get_swappiness(lruvec, sc);
> > + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> >
> > - while (true) {
> > + nr_to_scan = get_nr_to_scan(lruvec, sc, memcg, swappiness);
> > + while (nr_to_scan > 0) {
> > int delta;
> > + DEFINE_MAX_SEQ(lruvec);
> >
> > - nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness);
> > - if (nr_to_scan <= 0)
> > + if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg)) {
> > + need_rotate = true;
> > break;
> > + }
> > +
> > + if (should_run_aging(lruvec, max_seq, sc, swappiness)) {
> > + if (try_to_inc_max_seq(lruvec, max_seq, swappiness, false))
> > + need_rotate = true;
> > + /* stop scanning as it's low on cold folios */
> > + break;
> > + }
> Can this bypass critical memcg protection checks and aging when nr_to_scan
> is 0?
> If get_nr_to_scan() returns 0 (which happens if evictable == 0 or when
> nr_to_scan is shifted to 0 at DEF_PRIORITY), the loop body is bypassed
> entirely.
> This means mem_cgroup_below_min() is never evaluated in those cases,
> preventing need_rotate from being set to true for protected memcgs.
> The reclaimer relies on this flag to rotate away from protected memcgs.
> Additionally, does this skip aging for empty offline memcgs
> (where evictable == 0), since they never enter the loop where
> should_run_aging() is called?
> The commit message states that offline memcgs should now be aged
> to enable further reclaim, but this bypass seems to prevent that.

Pretty much a false positive. Bypassing the loop body is not a problem
at DEF_PRIORITY, as there wasn't aging before either, and the rotation
for below-min memcgs is already handled by shrink_one before entering
the loop in the global reclaim path, and is inconsequential in the
memcg reclaim path.

And for offline memcg, nr_to_scan is never zero unless that memcg is
completely empty. If we got a lot of empty memcg slowing down the
memcg iterator due to missing rotation, that's a different issue
and not related or introduce by this patch.