Re: [PATCH v5 14/14] mm/vmscan: unify writeback reclaim statistic and throttling

From: Kairui Song

Date: Sat Apr 18 2026 - 12:58:20 EST

On Mon, Apr 13, 2026 at 12:53 AM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@xxxxxxxxxx> wrote:
>
> From: Kairui Song <kasong@xxxxxxxxxxx>
>
> Currently MGLRU and non-MGLRU handle the reclaim statistic and
> writeback handling very differently, especially throttling.
> Basically MGLRU just ignored the throttling part.
>
> Let's just unify this part, use a helper to deduplicate the code
> so both setups will share the same behavior.
>
> Test using following reproducer using bash:
>
> echo "Setup a slow device using dm delay"
> dd if=/dev/zero of=/var/tmp/backing bs=1M count=2048
> LOOP=$(losetup --show -f /var/tmp/backing)
> mkfs.ext4 -q $LOOP
> echo "0 $(blockdev --getsz $LOOP) delay $LOOP 0 0 $LOOP 0 1000" | \
> dmsetup create slow_dev
> mkdir -p /mnt/slow && mount /dev/mapper/slow_dev /mnt/slow
>
> echo "Start writeback pressure"
> sync && echo 3 > /proc/sys/vm/drop_caches
> mkdir /sys/fs/cgroup/test_wb
> echo 128M > /sys/fs/cgroup/test_wb/memory.max
> (echo $BASHPID > /sys/fs/cgroup/test_wb/cgroup.procs && \
> dd if=/dev/zero of=/mnt/slow/testfile bs=1M count=192)
>
> echo "Clean up"
> echo "0 $(blockdev --getsz $LOOP) error" | dmsetup load slow_dev
> dmsetup resume slow_dev
> umount -l /mnt/slow && sync
> dmsetup remove slow_dev
>
> Before this commit, `dd` will get OOM killed immediately if
> MGLRU is enabled. Classic LRU is fine.
>
> After this commit, throttling is now effective and no more spin on
> LRU or premature OOM. Stress test on other workloads also looking good.
>
> Global throttling is not here yet, we will fix that separately later.
>
> Suggested-by: Chen Ridong <chenridong@xxxxxxxxxxxxxxx>
> Tested-by: Leno Hou <lenohou@xxxxxxxxx>
> Reviewed-by: Axel Rasmussen <axelrasmussen@xxxxxxxxxx>
> Reviewed-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
> Signed-off-by: Kairui Song <kasong@xxxxxxxxxxx>
> ---
> mm/vmscan.c | 90 ++++++++++++++++++++++++++++---------------------------------
> 1 file changed, 41 insertions(+), 49 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a431f94ff3a3..43a3cadbb586 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1942,6 +1942,44 @@ static int current_may_throttle(void)
> return !(current->flags & PF_LOCAL_THROTTLE);
> }
>
> +static void handle_reclaim_writeback(unsigned long nr_taken,
> + struct pglist_data *pgdat,
> + struct scan_control *sc,
> + struct reclaim_stat *stat)
> +{
> + /*
> + * If dirty folios are scanned that are not queued for IO, it
> + * implies that flushers are not doing their job. This can
> + * happen when memory pressure pushes dirty folios to the end of
> + * the LRU before the dirty limits are breached and the dirty
> + * data has expired. It can also happen when the proportion of
> + * dirty folios grows not through writes but through memory
> + * pressure reclaiming all the clean cache. And in some cases,
> + * the flushers simply cannot keep up with the allocation
> + * rate. Nudge the flusher threads in case they are asleep.
> + */
> + if (stat->nr_unqueued_dirty == nr_taken && nr_taken) {

While doing self review, I noticed a small problem here: It should
return without updating the counters below if nr_taken == 0. Currently
it only skips the flusher.

We might see nr_taken == 0 because MGLRU has a retry logic: if
shrink_folio_list returned some folios for being dirty or writeback,
and, they became clean during that isolation time period, then MGLRU
will try call shrink_folio_list again without doing isolation again.

This patch is still fine with the retry here in most cases. But if a
folio was returned by shrink_folio_list for being dirty, then suddenly
became clean and triggered the retry, then became dirty again. Now the
counter below might be skewed since a dirty folio is counted twice.
Still this is not a big issue, and I couldn't find a way to
reproduce this even on purpose, since that requires a few really short
time windows to hit together, and the result is also hardly
observable. But for a 100% accuracy, I'll update this patch with:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 71b4ef0e6735..af14efbc0cd8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1958,7 +1958,7 @@ static void handle_reclaim_writeback(unsigned
long nr_taken,
* the flushers simply cannot keep up with the allocation
* rate. Nudge the flusher threads in case they are asleep.
*/
- if (stat->nr_unqueued_dirty == nr_taken && nr_taken) {
+ if (stat->nr_unqueued_dirty == nr_taken) {
wakeup_flusher_threads(WB_REASON_VMSCAN);
/*
* For cgroupv1 dirty throttling is achieved by waking up
@@ -4830,7 +4830,9 @@ static int evict_folios(unsigned long
nr_to_scan, struct lruvec *lruvec,
retry:
reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
sc->nr_reclaimed += reclaimed;
- handle_reclaim_writeback(isolated, pgdat, sc, &stat);
+ /* Retry pass is only meant for clean folios without new isolation */
+ if (isolated)
+ handle_reclaim_writeback(isolated, pgdat, sc, &stat);
trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
type_scanned, reclaimed, &stat, sc->priority,
type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);

Then it should be perfect.

We might better just remove that retry logic completely later, it's
meant to avoid folio_rotate_reclaimable from missing isolated folios.
That should be done in a cleaner way. The current retry loop also may
lead to inaccurate tracepoint data, not a new or major problem so not
touching that part for now.