Re: [PATCH RFC] memcg: add per-cgroup dirty page controls (dirty_ratio, dirty_min)

From: Jan Kara

Date: Mon May 25 2026 - 08:20:59 EST


Thanks for the details and sorry for a delayed reply. I had two conferences
lately...

On Wed 13-05-26 21:10:10, Alireza Haghdoost wrote:
> On Wed 06-05-26 16:21:00, Jan Kara wrote:
> > Things like motivation actually belong to the changelog itself, measured
> > results how the patch helps as well. On the other hand stuff like history
> > is largely irrelevant here, frankly I don't have a bandwidth to carefully
> > read the huge amount of text LLM has generated below so please try to make
> > it more concise for next time.
>
> Understood. Will trim for the non-RFC posting; apologies for the
> volume.
>
> > ... I quite don't see how a multisecond stalls you are describing would
> > happen [...] If we are below freerun in the memcg, the task dirtying
> > folios from that memcg shouldn't be throttled at all, once we get above
> > freerun we throttle by maximum of throttling delay decided from global
> > and memcg situation [...]
>
> The stall is reachable even with the victim's memcg well below its
> own freerun. The freerun shortcut in balance_dirty_pages() is an AND,
> not OR:
>
> if (gdtc->freerun && (!mdtc || mdtc->freerun))
> goto free_running;

True but this is mostly a performance optimization.

> Once gdtc is over freerun (because the noisy neighbour pushed it
> there) the shortcut does not fire, even when mdtc->freerun is true.

Below in balance_dirty_pages() we also have:

/*
* Calculate global domain's pos_ratio and select the
* global dtc by default.
*/
balance_wb_limits(gdtc, strictlimit);
if (gdtc->freerun)
goto free_running;
sdtc = gdtc;

if (mdtc) {
/*
* If memcg domain is in effect, calculate its
* pos_ratio. @wb should satisfy constraints from
* both global and memcg domains. Choose the one
* w/ lower pos_ratio.
*/
balance_wb_limits(mdtc, strictlimit);
if (mdtc->freerun)
goto free_running;

which is the key logic. So unless you have strictlimit enabled (which you
didn't mention you'd have), being under freerun limit in your memcg is
enough to protect you from a noisy neighbor.

> After the shortcut fails, the per-task pause is computed from the
> dtc with the smaller pos_ratio:
>
> if (mdtc->pos_ratio < gdtc->pos_ratio)
> sdtc = mdtc;
>
> When global is the worse domain, the victim sleeps against global
> state, not memcg state.

*If* you are above freerun in your memcg (or have strictlimit enabled) then
yes, I agree.

> > So can you perhaps share more details about the configuration where
> > you observe these delays to innocent tasks due to another task
> > dirtying a lot of memory? How many page cache in total and dirty
> > pages are there in each memcg [...]? Is the delayed task really
> > throttled in balance_dirty_pages()?
>
> Yes. Re-ran the reproducer: stock 7.0-rc5, ext4 on virtio-blk
> throttled to 256 KB/s, dirty_bytes=32M, dirty_background_bytes=16M
> (freerun = 24 MB), noisy = single fio job doing unlimited buffered
> randwrite, victim = single fio job doing 4 KiB sequential write
> rate-limited to 500 KB/s.
>
> Per-memcg snapshot during the contended phase, ~10 s into the run:
>
> noisy memcg victim memcg global
> memory.current 47 MB 21 MB --
> file (cache) 38 MB 14 MB --
> file_dirty 26 MB 1.7 MB 27 MB
> file_writeback 1.5 MB 4.0 MB 5.3 MB
>
> Victim memcg holds 1.7 MB dirty, far below any reasonable per-memcg
> freerun. Global dirty (NR_FILE_DIRTY + NR_WRITEBACK ~ 32 MB) is over
> the 24 MB freerun ceiling, driven entirely by noisy.

Ah, OK. I think I see what's going on. How much page cache does the machine
have in total and what are memory limits for the noisy and victim memcgs?
Because there's this somewhat surprising behavior when you configure dirty
limits in bytes in domain_dirty_limits() - the memcg dirty limit will
roughly be dirty_bytes / global_available_memory * memcg_available (where
memcg_available is memcg page cache size + how much memcg can grow from the
current size until it hits memory limit). Since you set dirty_bytes to 32M,
your machine presumably has gigabytes of memory, then it's possible victim
memcg dirty limits end up really low.

> The victim writer (fio with psync) is in fact sleeping in
> balance_dirty_pages(). One stack snapshot during a stall:
>
> [<0>] balance_dirty_pages+0x5c5/0xac0
> [<0>] balance_dirty_pages_ratelimited_flags+0x2a1/0x380
> [<0>] generic_perform_write+0x194/0x280
> [<0>] ext4_buffered_write_iter+0x63/0x110
> [<0>] vfs_write+0x28d/0x450
> [<0>] __x64_sys_pwrite64+0x8c/0xc0
> [<0>] do_syscall_64+0xfa/0x520
> [<0>] entry_SYSCALL_64_after_hwframe+0x77/0x7f
>
> Sampling /proc/<pid>/wchan at 100 Hz across the contended phase
> yields the histogram:
>
> 104 balance_dirty_pages
> 88 hrtimer_nanosleep (the fio rate-limit sleep between writes)
> 12 RUNNING
> 4 p9_client_rpc (virtfs, host-guest filesystem RPC)
> 3 d_alloc_parallel
>
> The vast majority of non-rate-limit samples have the writer parked
> in balance_dirty_pages(). Victim per-IO clat in this run reaches a
> 3 s tail (worst single 4 KiB pwrite blocked ~3.0 s) while its own
> memcg holds < 2 MB dirty.

If dirty limits for the victim memcg end up really low, then yes, this is
what I'd expect.

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR