Re: [PATCH RFC] memcg: add per-cgroup dirty page controls (dirty_ratio, dirty_min)

From: Alireza Haghdoost

Date: Thu May 14 2026 - 00:10:54 EST


On Wed 06-05-26 16:21:00, Jan Kara wrote:
> Things like motivation actually belong to the changelog itself, measured
> results how the patch helps as well. On the other hand stuff like history
> is largely irrelevant here, frankly I don't have a bandwidth to carefully
> read the huge amount of text LLM has generated below so please try to make
> it more concise for next time.

Understood. Will trim for the non-RFC posting; apologies for the
volume.

> ... I quite don't see how a multisecond stalls you are describing would
> happen [...] If we are below freerun in the memcg, the task dirtying
> folios from that memcg shouldn't be throttled at all, once we get above
> freerun we throttle by maximum of throttling delay decided from global
> and memcg situation [...]

The stall is reachable even with the victim's memcg well below its
own freerun. The freerun shortcut in balance_dirty_pages() is an AND,
not OR:

if (gdtc->freerun && (!mdtc || mdtc->freerun))
goto free_running;

Once gdtc is over freerun (because the noisy neighbour pushed it
there) the shortcut does not fire, even when mdtc->freerun is true.
After the shortcut fails, the per-task pause is computed from the
dtc with the smaller pos_ratio:

if (mdtc->pos_ratio < gdtc->pos_ratio)
sdtc = mdtc;

When global is the worse domain, the victim sleeps against global
state, not memcg state.

> So can you perhaps share more details about the configuration where
> you observe these delays to innocent tasks due to another task
> dirtying a lot of memory? How many page cache in total and dirty
> pages are there in each memcg [...]? Is the delayed task really
> throttled in balance_dirty_pages()?

Yes. Re-ran the reproducer: stock 7.0-rc5, ext4 on virtio-blk
throttled to 256 KB/s, dirty_bytes=32M, dirty_background_bytes=16M
(freerun = 24 MB), noisy = single fio job doing unlimited buffered
randwrite, victim = single fio job doing 4 KiB sequential write
rate-limited to 500 KB/s.

Per-memcg snapshot during the contended phase, ~10 s into the run:

noisy memcg victim memcg global
memory.current 47 MB 21 MB --
file (cache) 38 MB 14 MB --
file_dirty 26 MB 1.7 MB 27 MB
file_writeback 1.5 MB 4.0 MB 5.3 MB

Victim memcg holds 1.7 MB dirty, far below any reasonable per-memcg
freerun. Global dirty (NR_FILE_DIRTY + NR_WRITEBACK ~ 32 MB) is over
the 24 MB freerun ceiling, driven entirely by noisy.

The victim writer (fio with psync) is in fact sleeping in
balance_dirty_pages(). One stack snapshot during a stall:

[<0>] balance_dirty_pages+0x5c5/0xac0
[<0>] balance_dirty_pages_ratelimited_flags+0x2a1/0x380
[<0>] generic_perform_write+0x194/0x280
[<0>] ext4_buffered_write_iter+0x63/0x110
[<0>] vfs_write+0x28d/0x450
[<0>] __x64_sys_pwrite64+0x8c/0xc0
[<0>] do_syscall_64+0xfa/0x520
[<0>] entry_SYSCALL_64_after_hwframe+0x77/0x7f

Sampling /proc/<pid>/wchan at 100 Hz across the contended phase
yields the histogram:

104 balance_dirty_pages
88 hrtimer_nanosleep (the fio rate-limit sleep between writes)
12 RUNNING
4 p9_client_rpc (virtfs, host-guest filesystem RPC)
3 d_alloc_parallel

The vast majority of non-rate-limit samples have the writer parked
in balance_dirty_pages(). Victim per-IO clat in this run reaches a
3 s tail (worst single 4 KiB pwrite blocked ~3.0 s) while its own
memcg holds < 2 MB dirty.

I'm happy to share the full traces and the reproducer if useful.

Thanks for the review,
Alireza