Re: [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling

From: Kairui Song

Date: Sat Apr 25 2026 - 09:30:20 EST

On Sat, Apr 25, 2026 at 8:18 PM Barry Song <baohua@xxxxxxxxxx> wrote:
>
> On Fri, Apr 24, 2026 at 8:56 PM Kairui Song <ryncsn@xxxxxxxxx> wrote:
> > Hi Barry,
> >
> > I ran your test script a few times, and strangely I can't reproduce
> > it. Swapniess behaves similarly after or before this series. I
> > directly checked out the mm-new commit of this series (4ce85c040e0a)
> > and compare to the mm-new commit right before this series
> > (31a112f05f62). I also extended your script a bit to test more
> > swappiness:
>
> Hi Kairui,
> I reset the repository to commit 4ce85c040e0a using
> git reset --hard, and I can still reproduce the
> swappiness issue. My machine is:
>
> barry@barry-desktop:~$ lscpu
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Address sizes: 39 bits physical, 48 bits virtual
> Byte Order: Little Endian
> CPU(s): 20
> On-line CPU(s) list: 0-19
> Vendor ID: GenuineIntel
> Model name: Intel(R) Core(TM) i9-10900 CPU @ 2.80GHz
> CPU family: 6
> Model: 165
> Thread(s) per core: 2
> Core(s) per socket: 10
> Socket(s): 1
> Stepping: 5
> CPU max MHz: 2800.0000
> CPU min MHz: 800.0000
> BogoMIPS: 5599.85
>
>
> swap is zRAM only:
> barry@barry-desktop:~$ cat /proc/swaps
> Filename Type Size Used Priority
> /dev/zram0 partition 12582908 280940 5
>
> The data is as below,
>
> *** Executing round 1 ***
> set swappiness to 35
>
> real 1m51.699s
> user 25m31.134s
> sys 4m13.127s
> pswpin: 1562949
> pswpout: 4840525
> pgpgin: 8751872
> pgpgout: 19741097
> swpout_zero: 1095783
> swpin_zero: 18079
> refault_file: 515292
> refault_anon: 1580980
>
> *** Executing round 2 ***
> set swappiness to 70
>
> real 1m51.603s
> user 25m33.600s
> sys 4m21.738s
> pswpin: 1786413
> pswpout: 5350804
> pgpgin: 8833652
> pgpgout: 21715596
> swpout_zero: 1230981
> swpin_zero: 21051
> refault_file: 313099
> refault_anon: 1807417
>
> *** Executing round 3 ***
> set swappiness to 105
>
> real 1m50.315s
> user 25m40.863s
> sys 4m12.446s
> pswpin: 1555289
> pswpout: 4911737
> pgpgin: 7597548
> pgpgout: 19956948
> swpout_zero: 1125969
> swpin_zero: 17594
> refault_file: 237475
> refault_anon: 1572835
>
> *** Executing round 4 ***
> set swappiness to 140
>
> real 1m50.992s
> user 25m34.774s
> sys 4m14.068s
> pswpin: 1642575
> pswpout: 5027730
> pgpgin: 7937214
> pgpgout: 20426400
> swpout_zero: 1155712
> swpin_zero: 20248
> refault_file: 215237
> refault_anon: 1662775
>
> *** Executing round 5 ***
> set swappiness to 175
>
> real 1m50.207s
> user 25m38.244s
> sys 4m7.655s
> pswpin: 1522633
> pswpout: 4788104
> pgpgin: 7307172
> pgpgout: 19464984
> swpout_zero: 1109281
> swpin_zero: 18085
> refault_file: 186203
> refault_anon: 1540669

Hmm, but reading the result you just posted, isn't swappiness actually
working as expected? Here is the data you just posted:

swappiness: 35 refault_file/anon: 515292 1580980
swappiness: 70 refault_file/anon: 313099 1807417
swappiness: 150 refault_file/anon: 237475 1572835
swappiness: 140 refault_file/anon: 215237 1662775
swappiness: 175 refault_file/anon: 186203 1540669

Higher swappiness we have, lower file refault we have.

> My bisect shows that the commit causing the swappiness issue is:
>
> [PATCH v6 05/14] mm/mglru: scan and count the exact number of folios
>
> Before that, swappiness behaves as expected, and there is
> also less swap-out/in activity(and much shorter sys time).
>
> *** Executing round 1 ***
> set swappiness to 35
>
> real 1m49.406s
> user 25m28.458s
> sys 3m41.098s
> pswpin: 984605
> pswpout: 3329809
> pgpgin: 5985696
> pgpgout: 13648560
> swpout_zero: 780136
> swpin_zero: 11379
> refault_file: 367629
> refault_anon: 995982
>
> *** Executing round 2 ***
> set swappiness to 70
>
> real 1m48.577s
> user 25m34.994s
> sys 3m42.694s
> pswpin: 985650
> pswpout: 3450097
> pgpgin: 5468828
> pgpgout: 14116020
> swpout_zero: 820143
> swpin_zero: 11808
> refault_file: 245353
> refault_anon: 997410
>
> *** Executing round 3 ***
> set swappiness to 105
>
> real 1m49.262s
> user 25m34.871s
> sys 3m41.633s
> pswpin: 998178
> pswpout: 3553741
> pgpgin: 5328896
> pgpgout: 14535068
> swpout_zero: 840706
> swpin_zero: 10393
> refault_file: 205514
> refault_anon: 1008525
>
> *** Executing round 4 ***
> set swappiness to 140
>
> real 1m49.417s
> user 25m35.395s
> sys 3m47.169s
> pswpin: 1138043
> pswpout: 3756034
> pgpgin: 5807584
> pgpgout: 15345816
> swpout_zero: 884539
> swpin_zero: 12652
> refault_file: 185767
> refault_anon: 1150649
>
> *** Executing round 5 ***
> set swappiness to 175
>
> real 1m49.654s
> user 25m35.244s
> sys 3m53.330s
> pswpin: 1235427
> pswpout: 4058085
> pgpgin: 6108792
> pgpgout: 16547764
> swpout_zero: 974086
> swpin_zero: 14280
> refault_file: 170452
> refault_anon: 1249705
>
> It’s too late today; I’ll continue debugging tomorrow.

Checking the data you just posted before that commit:
swappiness: 35 refault_file/anon: 367629 995982
swappiness: 70 refault_file/anon: 245353 997410
swappiness: 105 refault_file/anon: 205514 1008525
swappiness: 140 refault_file/anon: 185767 1150649
swappiness: 175 refault_file/anon: 170452 1249705

And after that commit is:
swappiness: 35 refault_file/anon: 515292 1580980
swappiness: 70 refault_file/anon: 313099 1807417
swappiness: 150 refault_file/anon: 237475 1572835
swappiness: 140 refault_file/anon: 215237 1662775
swappiness: 175 refault_file/anon: 186203 1540669

So I think the problem is not swappiness, but there are more anon
refaults after that commit.

Before:
pswpin: 998178
pswpout: 3553741
pgpgin: 5328896
pgpgout: 14535068

After:
pswpin: 1555289
pswpout: 4911737
pgpgin: 7597548
pgpgout: 19956948

I just ran a matrix of for kernels (mainline, mm-new HEAD, before this
series, after this series) X 3 different memcg configs (-j96 3G, -j48
2G, -j24 1G), and none of these showed any regression but all
improvement. That's really odd.

One possibility is that I removed the:

if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS >
lrugen->max_seq)
scanned = 0;

Which will make the reclaim loop go further and trigger aging.
Previously if reclaim drained the LRU's cold gens, it may go reclaim
slab instead. So idle inodes will be dropped with the mapping and
reclaim more file, and we won't see any refault data from that since
the mapping itself is gone. Sys will be lower too, as IO isn't counted
as sys. Checking your data, despite sys is higher, real is acutually
lower, which matches my guess.

Will the following patch help? I'm not sure if this is the problem,
but this added back that early abort, personally I don't think this
really makes much sense as it's more like a workaround for other
issues, but if that helps we might better keep it.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 30a89224117b..c1e7c65ff3b9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4837,6 +4837,7 @@ static int evict_folios(unsigned long
nr_to_scan, struct lruvec *lruvec,
int scanned, reclaimed;
int isolated = 0, type, type_scanned;
bool skip_retry = false;
+ struct lru_gen_folio *lrugen = &lruvec->lrugen;
struct mem_cgroup *memcg = lruvec_memcg(lruvec);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);

@@ -4852,6 +4853,10 @@ static int evict_folios(unsigned long
nr_to_scan, struct lruvec *lruvec,
if (scanned)
try_to_inc_min_seq(lruvec, swappiness);

+ /* Out of cold folios, return 0 to abort early and also
trigger shrinkers beside LRU */
+ if (evictable_min_seq(lrugen->min_seq, swappiness) +
MIN_NR_GENS > lrugen->max_seq)
+ scanned = 0;
+
lruvec_unlock_irq(lruvec);

if (list_empty(&list))

And this could cause early OOM, we have observed that for several
times due to the early return. So maybe we better check sc->priority
too, or move this to should_abort_scan?

Or perhaps we should just restore the behavior of never running aging
at DEF_PRIOTIY, which seems better and safer like below:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 30a89224117b..2080522ea924 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4917,14 +4917,14 @@ static bool should_run_aging(struct lruvec
*lruvec, unsigned long max_seq,
{
DEFINE_MIN_SEQ(lruvec);

- /* have to run aging, since eviction is not possible anymore */
- if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
- return true;
-
/* try to avoid aging, do gentle reclaim at the default priority */
if (sc->priority == DEF_PRIORITY)
return false;

+ /* have to run aging, since eviction is not possible anymore */
+ if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
+ return true;
+
/* better to run aging even though eviction is still possible */
return evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS == max_seq;
}

I'll keep testing with more FS and setup. What FS are you using? This
might be related to FS side reclaim as well if it's caused by shrinker
balance.

> [...]
> > Since you mentioned it's mm-new vs mainline, and you have reverted
> > part of this series and the problem is still there. Could it be
> > related to something else in mm-new? I'll keep testing more stress and
> > workload to dig deeper too. Or maybe the swappiness behavior just
> > changed slightly, some it may perform better or worse depending on
> > timing and workload? Swappiness on MGLRU currently only works as a
> > factor for calculating the refault and reclaim balance of anon / file
> > so it may behave a bit unpredictable. There isn't a proportional
> > calculation like active / inactive LRU. That's a problem too, and we
> > might fix that later.
>
> read_ctrl_pos() should also bias towards swappiness, as
> both sp and pv gains are affected by it. Yes, we need to
> fix the swappiness for mglru.

Yes, read_ctrl_pos is the helper for calculating the refault and
reclaim balance that I was talking about.