Re: [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling
From: Kairui Song
Date: Thu May 14 2026 - 14:50:52 EST
On Tue, Apr 28, 2026 at 2:07 AM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@xxxxxxxxxx> wrote:
>
> From: Kairui Song <kasong@xxxxxxxxxxx>
>
> This series cleans up and slightly improves MGLRU's reclaim loop and
> dirty writeback handling. As a result, we can see an up to ~30% increase
> in some workloads like MongoDB with YCSB and a huge decrease in file
> refault, no swap involved. Other common benchmarks have no regression,
> and LOC is reduced, with less unexpected OOM, too.
>
> Some of the problems were found in our production environment, and
> others were mostly exposed while stress testing during the development
> of the LSM/MM/BPF topic on improving MGLRU [1]. This series cleans up
> the code base and fixes several performance issues, preparing for
> further work.
>
> MGLRU's reclaim loop is a bit complex, and hence these problems are
> somehow related to each other. The aging, scan number calculation, and
> reclaim loop are coupled together, and the dirty folio handling logic is
> quite different, making the reclaim loop hard to follow and the dirty
> flush ineffective.
>
> This series slightly cleans up and improves these issues using a scan
> budget by calculating the number of folios to scan at the beginning of
> the loop, and decouples aging from the reclaim calculation helpers.
> Then, move the dirty flush logic inside the reclaim loop so it can kick
> in more effectively. These issues are somehow related, and this series
> handles them and improves MGLRU reclaim in many ways.
>
> Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
> and a 128G memory machine using NVME as storage.
Hi All,
This is a supplementary test report and explaining why we are using
these cases. All tests below, unless explicitly declared otherwise,
are run at least six times, using the median result. Some tests are
also run against MGLRU-FG[1] as a reference. (MGLRU-FG is still under
development so the reading may change - hopefully for the better).
>
> MongoDB
> =======
> Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
> threads:32), which does 95% read and 5% update to generate mixed read
> and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and
> the WiredTiger cache size is set to 4.5G, using NVME as storage.
MongoDB with workloadb has mixed writeback and read pressure, which
tests the LRU's capability to handle writeback flushing while
protecting the workingset.
Using the same test setup, I retested everything with Classical LRU
included (which I will refer to as CLRU below, open to suggestions for
a better abbreviation :) ). I rebased it on top of the current 7.1 rc
with a clean test environment.
CLRU:
93713.640901 ops/sec
workingset_refault_file 15013443
pgpgin 85365614
pgpgout 5866508
MGLRU Before:
60653.502655 ops/sec
workingset_refault_file 12904916
pgpgin 165366622
pgpgout 5219588
MGLRU After:
82384.354760 ops/sec
workingset_refault_file 7128285
pgpgin 113170693
pgpgout 5639724
Before this series, MGLRU lagged CLRU by approximately 35% on this
workload. This is the case where MGLRU has historically struggled the
most. This series closes most of that gap (within ~13%), and
MGLRU-FG[1] will close the rest (within noise). The trajectory is
clear and the work is ongoing:
MGLRU-FG:
92930.697550 ops/sec
workingset_refault_file 10775748
pgpgin 98558215
pgpgout 5736764
It's very interesting that MGLRU-FG and CLRU both have a higher
workingset_refault_file, but lower pgpgin, I suspect this could be
related to the slab (inode) shrinking balance. I ran into the similar
issue before [2], which can be looked into later but I think that's
irrelevant to this series and we are definitely on the right track.
The test results above basically match the cover letter as well,
reading is a bit different due to different test environment which
isn't a issue, so I think there is no need to update that.
>
> Not using SWAP.
>
> Before:
> Throughput(ops/sec): 62485.02962831822
> AverageLatency(us): 500.9746963330107
> pgpgin 159347462
> pgpgout 5413332
> workingset_refault_anon 0
> workingset_refault_file 34522071
>
> After:
> Throughput(ops/sec): 79760.71784646061 (+27.6%, higher is better)
> AverageLatency(us): 391.25169970043726 (-21.9%, lower is better)
> pgpgin 111093923 (-30.3%, lower is better)
> pgpgout 5437456
> workingset_refault_anon 0
> workingset_refault_file 19566366 (-43.3%, lower is better)
>
> We can see a significant performance improvement after this series.
> The test is done on NVME and the performance gap would be even larger
> for slow devices, such as HDD or network storage. We observed over
> 100% gain for some workloads with slow IO.
>
> Chrome & Node.js [3]
> ====================
> Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
> nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
> workers:
>
> Before:
> Total requests: 79915
> Per-worker 95% CI (mean): [1233.9, 1263.5]
> Per-worker stdev: 59.2
> Jain's fairness: 0.997795 (1.0 = perfectly fair)
> Latency:
> Bucket Count Pct Cumul
> [0,1)s 26859 33.61% 33.61%
> [1,2)s 7818 9.78% 43.39%
> [2,4)s 5532 6.92% 50.31%
> [4,8)s 39706 49.69% 100.00%
>
> After:
> Total requests: 81382
> Per-worker 95% CI (mean): [1241.9, 1301.3]
> Per-worker stdev: 118.8
> Jain's fairness: 0.991480 (1.0 = perfectly fair)
> Latency:
> Bucket Count Pct Cumul
> [0,1)s 26696 32.80% 32.80%
> [1,2)s 8745 10.75% 43.55%
> [2,4)s 6865 8.44% 51.98%
> [4,8)s 39076 48.02% 100.00%
>
> Reclaim is still fair and effective, total requests number seems
> slightly better.
Chrome & Node.js is very common workload for many users. Running these
workloads in different cgroups can apply equal pressure to all cgroups
under a global pressure, hence testing the LRU's ability to detect and
protect the working set, efficiency, and balance reclamation between
multiple tenants.
I'll post the summary of test result since the raw test result is way too long.
CLRU:
THROUGHPUT
Total requests: 62399
Per-worker mean: 975.0
Per-worker 95% CI (mean): [ 941.9, 1008.1]
LATENCY DISTRIBUTION (all workers aggregated)
Bucket Count Pct Cumul
[0,1)s 20051 32.13% 32.13%
[1,2)s 2255 3.61% 35.75%
[2,4)s 6149 9.85% 45.60%
[4,8)s 33927 54.37% 99.97%
[8,16)s 17 0.03% 100.00%
FAIRNESS (per-worker total requests)
Jain's fairness index: 0.982156 (1.0 = perfectly fair)
MGLRU before:
THROUGHPUT
Total requests: 81898
Per-worker mean: 1279.7
Per-worker 95% CI (mean): [ 1259.0, 1300.4]
LATENCY DISTRIBUTION (all workers aggregated)
Bucket Count Pct Cumul
[0,1)s 28392 34.67% 34.67%
[1,2)s 8022 9.80% 44.46%
[2,4)s 6130 7.48% 51.95%
[4,8)s 39354 48.05% 100.00%
FAIRNESS (per-worker total requests)
Jain's fairness index: 0.995893 (1.0 = perfectly fair)
MGLRU after:
THROUGHPUT
Total requests: 82901
Per-worker mean: 1295.3
Per-worker 95% CI (mean): [ 1265.3, 1325.4]
LATENCY DISTRIBUTION (all workers aggregated)
Bucket Count Pct Cumul
[0,1)s 28128 33.93% 33.93%
[1,2)s 8756 10.56% 44.49%
[2,4)s 7028 8.48% 52.97%
[4,8)s 38989 47.03% 100.00%
FAIRNESS (per-worker total requests)
Jain's fairness index: 0.991607 (1.0 = perfectly fair)
In summary MGLRU performs very well, both before and after this
series, across throughput, latency, and fairness. I also tested
MGLRU-FG, which yielded similar results with a per-worker 95% CI
(mean) of [1275.5, 1333.9].
>
> OOM issue with aging and throttling
> ===================================
> For the throttling OOM issue, it can be easily reproduced using dd and
> cgroup limit as demonstrated and fixed by a later patch in this series.
Skipping this one, aging/throttling OOM is an MGLRU-only issue, and
is fixed by this series.
>
> MySQL:
> ======
>
> Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
> ZRAM as swap and test command:
>
> sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
> --tables=48 --table-size=2000000 --threads=48 --time=600 run
>
> Before: 17303.41 tps
> After this series: 17291.50 tps
>
MySQL with sysbench is a standard database benchmark. The 24G InnoDB
buffer pool inside a 2G memory cgroup forces aggressive eviction of
cached database anon pages, testing the LRU's ability to identify hot
pages and the eviction path's efficiency under swap pressure.
Here is the retested result (average of 6 test run):
CLRU: 16245.330000 tps
MGLRU before: 17313.688333 tps
MGLRU after: 17286.195000 tps
MGLRU-FG: 17225.123333 tps
So MGLRU before/after/FG are all doing well with this one, ahead of
CLRU. It seems very slightly slower after this series, but this could
be noise, and I think it's fine to ignore. This series has no
noticeable effect on MGLRU for this kind of test.
>
> FIO:
> ====
> Testing with the following command, where /mnt/ramdisk is a
> 64G EXT4 ramdisk, each test file is 3G, in a 10G memcg,
> 6 test run each:
>
> fio --directory=/mnt/ramdisk --filename_format='test.$jobnum.img' \
> --name=cached --numjobs=16 --size=3072M --buffered=1 --ioengine=mmap \
> --rw=randread --norandommap --time_based \
> --ramp_time=1m --runtime=5m --group_reporting
>
> Before: 8968.76 MB/s
> After this series: 8995.63 MB/s
Random buffered FIO read on a ramdisk basically tests the LRU's
ability to evict the page cache efficiently. Results (average of 6
test runs):
CLRU: 8254.540000 MB/s
MGLRU before: 9033.908333 MB/s
MGLRU after: 9065.725000 MB/s
MGLRU-FG: 9067.105000 MB/s
MGLRU before / after / FG are all doing very well on this one,
>
> Also seem only noise level changes and no regression or slightly better.
>
> Build kernel:
> =============
> Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg
> using make -j96 and defconfig, measuring system time, 12 test run each.
>
> Before: 2873.52s
> After this series: 2811.88s
Build kernel test is a very classical test for us, many performance
test use this as a standard, it is a real workload and stands for many
compilation tasks.
I'll just post the system time for more direct comparision, following
the setup as describe in the cover letter:
CLRU: 2760.50user 5023.50system 1:51.89elapsed
MGLRU before: 2924.41user 2823.13system 1:28.09elapsed
MGLRU after: 2938.42user 2801.26system 1:28.10elapsed
MGLRU-FG: 2936.42user 2781.65system 1:27.84elapsed
MGLRU before / after / FG are all doing very well on this one,
Testing on disk instead using BTRFS with a 3G memcg:
CLRU:
real 1m51.325s
user 37m16.586s
sys 11m20.294s
MGLRU before:
real 1m49.649s
user 37m38.325s
sys 9m0.360s
MGLRU after:
real 1m49.223s
user 37m15.546s
sys 8m46.135s
MGLRU-FG:
real 1m49.908s
user 37m22.696s
sys 8m53.138s
Still, MGLRU before / after / FG are all doing very well.
>
> Also seem only noise level changes, no regression or very slightly better.
>
> Android:
> ========
> Xinyu reported a performance gain on Android, too, with this series. The
> test consisted of cold-starting multiple applications sequentially under
> moderate system load. [6]
>
> Before:
> Launch Time Summary (all apps, all runs)
> Mean 868.0ms
> P50 888.0ms
> P90 1274.2ms
> P95 1399.0ms
>
> After:
> Launch Time Summary (all apps, all runs)
> Mean 850.5ms (-2.07%)
> P50 861.5ms (-3.04%)
> P90 1179.0ms (-8.05%)
> P95 1228.0ms (-12.2%)
I've seen many reports from Android that MGLRU provides better battery
life and my personal experience backporting this series on my phone is
quite positive :). I currently lack a standard environment for Android
testing because I don't have any Android vendor support so I'll have
to skip the comparison on this one. And I think Xinyu's original
numbers are good enough for this series. (I remember seeing community
reports and historical reports in LPC or LSF/MM/BPF all look good so
far).
I've also posted other tests previously that all show this series is
behaving correctly, but I don't think we should include all of them or
this will be rediculiously long:
https://lore.kernel.org/linux-mm/CAMgjq7BzQAPp8u_3-9e3ueXmRCoW=2sydok0hFM=MYL7VC1YYg@xxxxxxxxxxxxxx/
======
In summary, I think these tests make a lot of sense, and re-testing
with CLRU in a row indicates that MGLRU performs very well with this
series. In most cases MGLRU performs much better. MGLRU suffered the
most during the MongoDB writeback workload (YCSB workloadb), and that
is exactly what we are solving, and gap is closing with a clear plan.
I can fold the per-benchmark rationale sentences and CLRU baselines
into a re-freshed cover letter (no code changes), or should we just
add a link to this email instead? The existing cover letter is already
long and sufficiently supportive IMO, and the new test result matches
what we already have.
In any case, I think we are headed in the right direction.
Link: https://lore.kernel.org/linux-mm/20260502-mglru-fg-v1-0-913619b014d9@xxxxxxxxxxx/
[1]
Link: https://lore.kernel.org/linux-mm/CAMgjq7BsY1tJeOZwSppxUN7Lha-_a7WLfhv1_bxTuU4EuiQyVg@xxxxxxxxxxxxxx/[2]