Re: [PATCH v3 14/14] mm/vmscan: unify writeback reclaim statistic and throttling

From: Kairui Song

Date: Sat Apr 04 2026 - 14:37:14 EST


On Sat, Apr 4, 2026 at 5:16 AM Axel Rasmussen <axelrasmussen@xxxxxxxxxx> wrote:
>
> On Thu, Apr 2, 2026 at 11:53 AM Kairui Song via B4 Relay
> <devnull+kasong.tencent.com@xxxxxxxxxx> wrote:
> >
> > From: Kairui Song <kasong@xxxxxxxxxxx>
> >
> > Currently MGLRU and non-MGLRU handle the reclaim statistic and
> > writeback handling very differently, especially throttling.
> > Basically MGLRU just ignored the throttling part.
> >
> > Let's just unify this part, use a helper to deduplicate the code
> > so both setups will share the same behavior.
> >
> > Test using following reproducer using bash:
> >
> > echo "Setup a slow device using dm delay"
> > dd if=/dev/zero of=/var/tmp/backing bs=1M count=2048
> > LOOP=$(losetup --show -f /var/tmp/backing)
> > mkfs.ext4 -q $LOOP
> > echo "0 $(blockdev --getsz $LOOP) delay $LOOP 0 0 $LOOP 0 1000" | \
> > dmsetup create slow_dev
> > mkdir -p /mnt/slow && mount /dev/mapper/slow_dev /mnt/slow
> >
> > echo "Start writeback pressure"
> > sync && echo 3 > /proc/sys/vm/drop_caches
> > mkdir /sys/fs/cgroup/test_wb
> > echo 128M > /sys/fs/cgroup/test_wb/memory.max
> > (echo $BASHPID > /sys/fs/cgroup/test_wb/cgroup.procs && \
> > dd if=/dev/zero of=/mnt/slow/testfile bs=1M count=192)
> >
> > echo "Clean up"
> > echo "0 $(blockdev --getsz $LOOP) error" | dmsetup load slow_dev
> > dmsetup resume slow_dev
> > umount -l /mnt/slow && sync
> > dmsetup remove slow_dev
> >
> > Before this commit, `dd` will get OOM killed immediately if
> > MGLRU is enabled. Classic LRU is fine.
> >
> > After this commit, throttling is now effective and no more spin on
> > LRU or premature OOM. Stress test on other workloads also looking good.
> >
> > Global throttling is not here yet, we will fix that separately later.
>
> If I understand correctly, I think this fixes this regression report
> [1] from a long time ago that was never fully resolved?
>
> [1]: https://lore.kernel.org/lkml/ZeC-u7GRSptoVqia@xxxxxxxxxxxxxx/
>
> We investigated at that time, but I don't feel we got to a consensus
> on how to solve it. I think we got a bit bogged down trying to
> "completely solve writeback throttling" rather than just doing some
> incremental improvement which fixed that particular case.
>

Hello Axel!

Yes, we also observed that problem. I almost forgot about that report,
thanks for the link! No worry, for the majority of the users I think
the problem was fixed already a year ago.

I asked Jingxiang previously to help fix that by waking up writeback
previously. In that discussion, the info is showing that fluster is
not waking at all, and Yafang reports that reverting 14aa8b2d5c2e can
fix it. So Jingxiang's fix seemed work well at that time:
https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@xxxxxxxxx/

AFAIK there seems to be no more reports of premature OOM in the mail
list since then, but later we found that that fix isn't enough for
some particular and rare setups (for example I used dm delay in the
test script above to simulate slow IO). Usually the reclaim can always
keep up, since it's rare for LRU to be full of writeback folios and
there are always clean folios to drop, waking up flusher is good
enough. But when under extreme pressure or very slow devices, LRU
could get congested with writeback folios. And it's hard to apply a
reasonable throttle or improve the dirty flush without a bit more
refactor first, and that's not the only cgroup OOM problem we
encountered.

With this series, I think the known problems mentioned above are all
covered in a clean way.

Global pressure and throttle is still not here yet, it's an even more
rare problem since LRU getting congested with writeback globally seems
already a really bad situation to me. That can also be fixed
separately later.