Re: [PATCH RFC] memcg: add per-cgroup dirty page controls (dirty_ratio, dirty_min)
From: Alireza Haghdoost
Date: Thu May 14 2026 - 00:10:54 EST
On Wed 06-05-26 16:21:00, Jan Kara wrote:
> Things like motivation actually belong to the changelog itself, measured
> results how the patch helps as well. On the other hand stuff like history
> is largely irrelevant here, frankly I don't have a bandwidth to carefully
> read the huge amount of text LLM has generated below so please try to make
> it more concise for next time.
Understood. Will trim for the non-RFC posting; apologies for the
volume.
> ... I quite don't see how a multisecond stalls you are describing would
> happen [...] If we are below freerun in the memcg, the task dirtying
> folios from that memcg shouldn't be throttled at all, once we get above
> freerun we throttle by maximum of throttling delay decided from global
> and memcg situation [...]
The stall is reachable even with the victim's memcg well below its
own freerun. The freerun shortcut in balance_dirty_pages() is an AND,
not OR:
if (gdtc->freerun && (!mdtc || mdtc->freerun))
goto free_running;
Once gdtc is over freerun (because the noisy neighbour pushed it
there) the shortcut does not fire, even when mdtc->freerun is true.
After the shortcut fails, the per-task pause is computed from the
dtc with the smaller pos_ratio:
if (mdtc->pos_ratio < gdtc->pos_ratio)
sdtc = mdtc;
When global is the worse domain, the victim sleeps against global
state, not memcg state.
> So can you perhaps share more details about the configuration where
> you observe these delays to innocent tasks due to another task
> dirtying a lot of memory? How many page cache in total and dirty
> pages are there in each memcg [...]? Is the delayed task really
> throttled in balance_dirty_pages()?
Yes. Re-ran the reproducer: stock 7.0-rc5, ext4 on virtio-blk
throttled to 256 KB/s, dirty_bytes=32M, dirty_background_bytes=16M
(freerun = 24 MB), noisy = single fio job doing unlimited buffered
randwrite, victim = single fio job doing 4 KiB sequential write
rate-limited to 500 KB/s.
Per-memcg snapshot during the contended phase, ~10 s into the run:
noisy memcg victim memcg global
memory.current 47 MB 21 MB --
file (cache) 38 MB 14 MB --
file_dirty 26 MB 1.7 MB 27 MB
file_writeback 1.5 MB 4.0 MB 5.3 MB
Victim memcg holds 1.7 MB dirty, far below any reasonable per-memcg
freerun. Global dirty (NR_FILE_DIRTY + NR_WRITEBACK ~ 32 MB) is over
the 24 MB freerun ceiling, driven entirely by noisy.
The victim writer (fio with psync) is in fact sleeping in
balance_dirty_pages(). One stack snapshot during a stall:
[<0>] balance_dirty_pages+0x5c5/0xac0
[<0>] balance_dirty_pages_ratelimited_flags+0x2a1/0x380
[<0>] generic_perform_write+0x194/0x280
[<0>] ext4_buffered_write_iter+0x63/0x110
[<0>] vfs_write+0x28d/0x450
[<0>] __x64_sys_pwrite64+0x8c/0xc0
[<0>] do_syscall_64+0xfa/0x520
[<0>] entry_SYSCALL_64_after_hwframe+0x77/0x7f
Sampling /proc/<pid>/wchan at 100 Hz across the contended phase
yields the histogram:
104 balance_dirty_pages
88 hrtimer_nanosleep (the fio rate-limit sleep between writes)
12 RUNNING
4 p9_client_rpc (virtfs, host-guest filesystem RPC)
3 d_alloc_parallel
The vast majority of non-rate-limit samples have the writer parked
in balance_dirty_pages(). Victim per-IO clat in this run reaches a
3 s tail (worst single 4 KiB pwrite blocked ~3.0 s) while its own
memcg holds < 2 MB dirty.
I'm happy to share the full traces and the reproducer if useful.
Thanks for the review,
Alireza