Re: [PATCH RFC] memcg: add per-cgroup dirty page controls (dirty_ratio, dirty_min)

From: Jan Kara

Date: Wed May 06 2026 - 10:34:56 EST


On Fri 01-05-26 22:28:38, Alireza Haghdoost via B4 Relay wrote:
> From: Alireza Haghdoost <haghdoost@xxxxxxxx>
>
> Add two cgroup v2 memory-controller knobs that bring
> balance_dirty_pages() throttling into per-cgroup scope so one noisy
> writer cannot stall peers sharing the same host:
>
> memory.dirty_ratio Per-cgroup dirty-page ceiling, 0..100 percent of
> the cgroup's dirtyable memory. 0 (default) leaves
> the cgroup subject to the global threshold only.
>
> memory.dirty_min Guaranteed dirty-page floor, byte value (default 0).
>
> The two knobs compose: dirty_ratio bounds how much dirty memory a
> cgroup may accrue, dirty_min guarantees a floor below which it is
> never throttled.
>
> Motivation, design trade-offs, cost analysis, validation data, and
> open questions are in the cover letter.
>
> Co-developed-by: Kshitij Doshi <kshitijd@xxxxxxxx>
> Signed-off-by: Kshitij Doshi <kshitijd@xxxxxxxx>
> Signed-off-by: Alireza Haghdoost <haghdoost@xxxxxxxx>
> Assisted-by: Cursor:claude-sonnet-4.5
> ---

Things like motivation actually belong to the changelog itself, measured
results how the patch helps as well. On the other hand stuff like history
is largely irrelevant here, frankly I don't have a bandwidth to carefully
read the huge amount of text LLM has generated below so please try to make
it more concise for next time.

> This RFC adds two cgroup v2 memory-controller knobs that give operators
> per-cgroup control over dirty-page throttling in balance_dirty_pages():
> memory.dirty_ratio (per-cgroup ceiling) and memory.dirty_min (guaranteed
> floor). A third knob, memory.dirty_weight, is forthcoming in a follow-up
> once we have validated the application site (see "Follow-ups" below).
> We are posting this as an RFC, as a single squashed patch, to get design
> feedback before splitting the prototype into a per-logical-change series.
>
> Motivation
> ==========
>
> balance_dirty_pages() (BDP) is a global throttle. It sleeps writers
> once the host-wide dirty count crosses a single threshold. On a
> container host that threshold is shared across cgroups. A cgroup
> that dirties pages faster than storage can drain them pushes the
> count over the limit. Every writer on the host then parks in
> io_schedule_timeout() -- including cgroups that have not dirtied a
> single page of their own.
>
> cgroup v2 already has per-memcg dirty accounting, but that accounting
> does not translate into per-memcg dirty throttling.

Not quite true. We do have per-memcg writeback workers and we do have
per-memcg dirty limits (inferred from global dirty limit tunings) that are
enforced...

> We see this in production: a buffered write-heavy container generates
> multi-second stalls for co-located latency-sensitive workloads.
> Moreover, dirty-page accumulation from a single noisy neighbor is a
> recurring contributor to mount-responsiveness degradation on shared hosts.

... and I quite don't see how a multisecond stalls you are describing would
happen. There is something I must be missing. The throttling works as
follows: Until we cross global freerun limit (that is (background_limit +
dirty_limit) / 2) nobody is throttled. Once we cross it, memcg dirty limits
start to be taken into account. If we are below freerun in the memcg, the
task dirtying folios from that memcg shouldn't be throttled at all, once we
get above freerun we throttle by maximum of throttling delay decided from
global and memcg situation so then long delays can start happening but it
would mean the "innocent" task's memcg had to get at least over the freerun
limit.

So can you perhaps share more details about the configuration where you
observe these delays to innocent tasks due to another task dirtying a lot of
memory? How many page cache in total and dirty pages are there in each
memcg (both for the aggressive dirtier and wrongly delayed task)? Is the
delayed task really throttled in balance_dirty_pages()?

Honza


> Prior work
> ==========
>
> Per-memcg dirty-page limits have been proposed before. Andrea Righi
> posted an initial RFC in February 2010 [1]; Greg Thelen continued the
> work through v9 in August 2011 [2]. That series added per-memcg dirty
> counters and hooked them into balance_dirty_pages(), but it bolted
> per-cgroup limits onto the global writeback path without making
> writeback itself cgroup-aware. Without cgroup-aware flusher threads,
> a cgroup exceeding its limit triggered writeback of inodes from any
> cgroup, giving poor isolation. The series was not merged.
>
> Konstantin Khlebnikov posted "[PATCHSET RFC 0/6] memcg: inode-based
> dirty-set controller" in January 2015 [4], which proposed
> memory.dirty_ratio (the same interface name this series uses) via an
> inode-tagged, filtered-writeback approach. Tejun Heo reviewed it
> and rejected it as a "dead end" that duplicated lower-layer policy
> without solving the underlying isolation problem; this rejection
> directly motivated Tejun's native cgwb rework described below.
>
> Tejun Heo's 48-patch cgroup-writeback rework, merged in Linux 4.2
> (commit e4bc13adfd01, "Merge branch 'for-4.2/writeback'"), took the
> different approach of restructuring writeback to be natively
> cgroup-aware: per-memcg wb_domain (commit 841710aa6e4a), per-memcg
> NR_FILE_DIRTY / NR_WRITEBACK accounting, and cgroup-aware flusher
> threads [3]. That work deliberately deferred user-facing policy knobs.
> This series adds the policy surface that consumes Tejun's
> infrastructure. The dirty_min reservation concept is, to our
> knowledge, new.
>
> A November 2023 LKML thread by Chengming Zhou [5] independently
> identified the identical throttling regression on cgroup v2 (a 5 GB
> container constantly throttled because memory.max * dirty_ratio yields
> too small a threshold for bursty workloads). Jan Kara participated and
> endorsed a bytes-based per-memcg dirty limit; no patches followed that
> discussion, confirming the gap this series fills.
>
> [1] https://lwn.net/Articles/408349/
> [2] https://lore.kernel.org/lkml/1313597705-6093-1-git-send-email-gthelen@xxxxxxxxxx/
> [3] https://lwn.net/Articles/648292/
> [4] https://lore.kernel.org/all/20150115180242.10450.92.stgit@buzz/
> [5] https://lore.kernel.org/all/109029e0-1772-4102-a2a8-ab9076462454@xxxxxxxxx/
>
> Proposed interface
> ==================
>
> Two new cgroup v2 files under the memory controller, absent on the root
> cgroup (CFTYPE_NOT_ON_ROOT):
>
> memory.dirty_ratio
> Integer [0, 100]. Per-cgroup dirty-page ceiling as a percentage of
> the cgroup's dirtyable memory (mdtc->avail: file cache plus
> reclaimable slack), the same base the global vm_dirty_ratio scales
> against. 0 (the default) disables the per-cgroup ceiling and leaves
> the cgroup subject to the global threshold only. A non-zero value
> that is stricter than vm_dirty_ratio overrides the global ratio for
> this cgroup via min(mdtc->thresh, cg_thresh); because both sides
> scale off the same base, the knob can never widen the cgroup past
> the global ceiling. A memory.dirty_bytes companion for byte-precise
> caps (mirroring vm_dirty_bytes) is noted under "Follow-ups" below.
> The prototype reads the value for the immediate memcg only;
> hierarchical enforcement (clamping against ancestors, like
> memory.max) is not implemented yet. We would like guidance on
> whether this is required for v1 or can follow in a subsequent
> series.
>
> memory.dirty_min
> Byte value (K/M/G suffixes accepted), default 0 (no reservation).
> Guaranteed dirty-page floor: while cgroup_dirty < dirty_min,
> throttling is bypassed (goto free_running). Lets a latency-sensitive
> cgroup buffer a small write burst even under global dirty pressure.
>
> dirty_min is an admission guarantee, so we have to prevent it from
> breaking the global dirty invariant. Two aspects:
>
> - Global cap. The sum of dirty_min reservations across all cgroups
> must not exceed a fraction of the global dirty threshold (our
> working number is 80%), so the system always retains some shared
> capacity. The prototype does not enforce this cap yet; we expect
> to either reject at write() time or clamp on read in a cheap
> precomputed effective_dirty_min. We would appreciate feedback on
> which approach the cgroup maintainers prefer.
>
> - Per-cgroup cap. A cgroup should not be able to reserve more dirty
> capacity than it can hold. Our tentative rule is
> effective_dirty_min = min(dirty_min, memory.max - memory.min),
> evaluated at enforcement time so it tracks live memory.max changes,
> rather than rejected at write. This is similar to how memory.low
> composes with memory.max.
>
> Neither cap is implemented in the prototype; both would land before
> a non-RFC posting.
>
> The two knobs compose: dirty_ratio bounds how much dirty memory a
> cgroup may accrue, dirty_min guarantees a floor below which it is never
> throttled.
>
> Test setup and results
> ======================
>
> To show the problem and the fix, we built a single reproducer that runs
> on an unmodified stock kernel and then on the patched kernel, using the
> same setup for both.
>
> Setup: QEMU guest with a virtio-blk disk throttled to 256 KB/s
> (bps_wr=262144). Two sibling cgroups, no io.weight; both share disk
> bandwidth equally. dirty_bytes=32MB, dirty_background_bytes=16MB
> (freerun = 24 MB). Files pre-allocated with fallocate before dirty
> pressure. Two phases per run: (1) victim alone (baseline), (2) noisy
> fills global dirty to the 32 MB cap, then victim runs contended for
> 30 s.
>
> - noisy: single fio job, unlimited write rate, fills global dirty pool.
> - victim: single fio job, rate-limited to 500 KB/s (128 IOPS target),
> 4 KiB sequential write().
>
> The high freerun (24 MB) ensures victim's solo dirty accumulation
> (244 KB/s x 30 s = 7.3 MB) stays below the threshold. BDP does not fire
> during the solo phase on either kernel, giving identical baselines.
>
> Stock kernel results (the problem):
>
> solo (no contention) contended inflation
> victim IOPS 125 5.1 24.5x worse
> victim p99 0.6 ms 152 ms 253x worse
>
> The contended p99 sits at fio's percentile bucket nearest MAX_PAUSE =
> HZ/5 = 200 ms (mm/page-writeback.c:49), the hard kernel ceiling on BDP
> sleep. The victim has near-zero dirty pages of its own but is forced to
> sleep because balance_dirty_pages() sees gdtc->dirty = NR_FILE_DIRTY +
> NR_WRITEBACK above the freerun threshold. Most of noisy's pages are
> queued in NR_WRITEBACK waiting for the throttled disk.
> memory.events.local on both cgroups shows max 0 throughout the run;
> this is not memory pressure inside either cgroup.
>
> Patched kernel results (the fix), with victim/memory.dirty_min = 16 MB:
>
> solo (no contention) contended inflation
> victim IOPS 125 125 1.0x
> victim p99 0.9 ms 0.7 ms 1.0x
>
> The patched kernel checks cgroup_dirty < dirty_min (4096 pages) before
> computing any sleep. Because the rate-limited victim's resident dirty
> set stays well below the reservation, the check fires on every write()
> -> goto free_running -> write() returns in ~7 us. The victim is fully
> protected.
>
> The figures above are the per-kernel medians of N=5 iterations and
> reflect a deterministic outcome on every iter: stock had cont_iops in
> 4.4..6.1 (retention 0.035..0.049) on all five iters, patched had
> cont_iops = 125.0 (retention = 1.000) on all five iters. The 5/5 stock
> iters all hit BDP's throttled regime; the 5/5 patched iters all
> bypassed it via the dirty_min check.
>
> Implementation
> ==============
>
> The patch touches five files:
>
> - include/linux/memcontrol.h: two new fields on struct mem_cgroup
> (dirty_ratio, dirty_min).
> - include/linux/writeback.h: a per-pass cg_dirty_cap field on
> struct dirty_throttle_control used to publish the memcg clamp to
> BDP's setpoint and rate-limit math.
> - include/trace/events/writeback.h: cg_dirty_cap added to the
> balance_dirty_pages tracepoint so operators can distinguish the
> memcg clamp from the global dirty_limit at runtime.
> - mm/memcontrol.c: registers the two cgroup v2 files with input
> validation.
> - mm/page-writeback.c: the throttling changes.
>
> The key changes in page-writeback.c:
>
> - The dirty_ratio clamp lives inside domain_dirty_limits(), keyed
> on wb->memcg_css (the inode owner's memcg). Every consumer of
> the memcg dtc -- writer throttle, flusher kworker,
> cgwb_calc_thresh -- sees the same clamped thresh and bg_thresh.
> The clamp uses mult_frac() so a small memcg does not collapse
> to a zero ceiling.
>
> - The dirty_min bypass lives in balance_dirty_pages() and is
> writer-keyed: the writing task's memcg is looked up under RCU,
> and when its dirty+in-flight backlog is below dirty_min the
> loop jumps to free_running, bypassing both the global and the
> per-memcg BDP gates. dirty_min is an admission guarantee for
> the writer's own cgroup, not for inode owners.
>
> - When the clamp engages, mdtc->dirty is replaced with the
> memcg-wide NR_FILE_DIRTY + NR_WRITEBACK sum so freerun /
> setpoint / rate-limit smoothing see the real backlog and pages
> migrating from NR_FILE_DIRTY into NR_WRITEBACK cannot silently
> widen the cap.
>
> Fast-path cost when neither knob is set: one rcu_read_lock/unlock
> pair plus a READ_ONCE(dirty_min) per balance_dirty_pages()
> iteration, and one READ_ONCE(dirty_ratio) per domain_dirty_limits()
> call. The memcg counter reads are gated on "knob armed" and do
> not fire on the default path. We have not measured the added
> cost yet, but we expect it to be in the noise of existing BDP
> bookkeeping. A tight pwrite() microbenchmark will confirm this
> before a non-RFC posting.
>
> Scope
> =====
>
> This series affects the writer-side throttle (balance_dirty_pages())
> only. It does not partition the flusher-side writeback queue. A
> cgroup's fsync() can still block behind inodes from other cgroups in
> writeback_sb_inodes(). We document this limit explicitly and expect
> writeback-queue partitioning to be a separate, larger effort.
>
> Interaction with block-layer throttles
> ======================================
>
> The two knobs are orthogonal to io.max / io.cost. balance_dirty_pages()
> runs before the bio reaches the block layer, so dirty_min simply allows
> a cgroup to keep accepting write() syscalls up to its reservation; the
> actual I/O is still subject to whatever block throttle is in effect. In
> the reproducer above, the disk-level bandwidth limit (256 KB/s) is
> applied at the QEMU virtio-blk layer, and the protected victim dirties
> roughly equal to the rate it can drain, so the block throttle is
> exercised on both kernels. We have not yet tested interaction with
> guest-side io.max settings; this is on the list before a non-RFC
> posting.
>
> Questions for maintainers
> =========================
>
> 1. Is writer-throttle-only scope (no flusher/writeback-queue work)
> acceptable for the first series?
> 2. Does dirty_ratio belong on struct mem_cgroup (as the prototype
> has it) or on the memcg's wb_domain? Routing it through wb_domain
> would let us reuse __wb_calc_thresh() and keep all threshold
> policy in one place; a possible split is dirty_ratio on wb_domain
> (it is threshold policy), dirty_min on the memcg (throttle
> bypass). We can go either way.
> 3. For dirty_min safety caps (per-cgroup and global sum), which
> approach do you prefer: reject at write time, or clamp on read in
> an effective_dirty_min?
> 4. Is hierarchical enforcement of dirty_ratio (clamp against
> ancestors) required for v1, or can it follow in a subsequent
> series?
>
> What's missing before a non-RFC posting
> =======================================
>
> - Split the monolithic prototype into a proper series (one patch per
> concept + Documentation + selftest).
> - Documentation/admin-guide/cgroup-v2.rst entries for both knobs.
> - tools/testing/selftests/cgroup/ test for interface surface and
> noisy-neighbor protection.
> - Implement the per-cgroup and global dirty_min safety caps described
> in the memory.dirty_min bullet.
> - Fast-path microbenchmark: confirm zero measurable regression for
> cgroups that have neither knob set.
> - Larger-N validation on real hardware (the current N=5 data is from
> a QEMU guest on a throttled virtio-blk).
>
> Follow-ups (out of scope for this series)
> =========================================
>
> - memory.dirty_weight: a priority weight knob that scales the BDP
> pause length, planned as a separate series. The prototype validated
> the interface surface but the application site (post-pause scaling
> vs. folding into pos_ratio / dirty_ratelimit) needs to be settled
> before we ship it. Happy to discuss in advance of that posting.
> - memory.dirty_bytes: a byte-value companion to memory.dirty_ratio,
> mirroring the global vm_dirty_bytes. For operators who want a
> byte-predictable per-cgroup dirty cap rather than a ratio of the
> cgroup's dirtyable memory. We have not prototyped this yet; we
> are listing it so reviewers know it is on the roadmap, since
> the ratio-only interface omits that use case.
> - Writeback-queue partitioning: flusher-side fairness across
> cgroups, as noted in Scope above.
>
> Looking forward to feedback.
>
> Thanks,
> Alireza Haghdoost <haghdoost@xxxxxxxx>
> Kshitij Doshi <kshitijd@xxxxxxxx>
> ---
> include/linux/memcontrol.h | 10 +++
> include/linux/writeback.h | 4 +
> include/trace/events/writeback.h | 5 +-
> mm/memcontrol.c | 62 ++++++++++++++
> mm/page-writeback.c | 179 ++++++++++++++++++++++++++++++++++++---
> 5 files changed, 249 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index dc3fa687759b..45ca949a4c68 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -323,6 +323,16 @@ struct mem_cgroup {
> spinlock_t event_list_lock;
> #endif /* CONFIG_MEMCG_V1 */
>
> + /* Per-memcg dirty-page controls (memory.dirty_ratio, memory.dirty_min) */
> + /*
> + * dirty_ratio: [0, 100] percent of dirtyable memory (mdtc->avail),
> + * matching the global vm_dirty_ratio base; 0 inherits the global
> + * threshold.
> + * dirty_min: dirty-page reservation, in pages; 0 disables the bypass.
> + */
> + unsigned int dirty_ratio;
> + unsigned long dirty_min;
> +
> struct mem_cgroup_per_node *nodeinfo[];
> };
>
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 62552a2ce5b9..e37632f728be 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -318,6 +318,10 @@ struct dirty_throttle_control {
> unsigned long thresh; /* dirty threshold */
> unsigned long bg_thresh; /* dirty background threshold */
> unsigned long limit; /* hard dirty limit */
> + unsigned long cg_dirty_cap; /* per-memcg dirty_ratio clamp for
> + * this pass, or PAGE_COUNTER_MAX
> + * when no memcg clamp applies
> + */
>
> unsigned long wb_dirty; /* per-wb counterparts */
> unsigned long wb_thresh;
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index bdac0d685a98..0bf86b3c903c 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -672,6 +672,7 @@ TRACE_EVENT(balance_dirty_pages,
> __array( char, bdi, 32)
> __field(u64, cgroup_ino)
> __field(unsigned long, limit)
> + __field(unsigned long, cg_dirty_cap)
> __field(unsigned long, setpoint)
> __field(unsigned long, dirty)
> __field(unsigned long, wb_setpoint)
> @@ -691,6 +692,7 @@ TRACE_EVENT(balance_dirty_pages,
> strscpy_pad(__entry->bdi, bdi_dev_name(wb->bdi), 32);
>
> __entry->limit = dtc->limit;
> + __entry->cg_dirty_cap = dtc->cg_dirty_cap;
> __entry->setpoint = (dtc->limit + freerun) / 2;
> __entry->dirty = dtc->dirty;
> __entry->wb_setpoint = __entry->setpoint *
> @@ -710,13 +712,14 @@ TRACE_EVENT(balance_dirty_pages,
>
>
> TP_printk("bdi %s: "
> - "limit=%lu setpoint=%lu dirty=%lu "
> + "limit=%lu cg_dirty_cap=%lu setpoint=%lu dirty=%lu "
> "wb_setpoint=%lu wb_dirty=%lu "
> "dirty_ratelimit=%lu task_ratelimit=%lu "
> "dirtied=%u dirtied_pause=%u "
> "paused=%lu pause=%ld period=%lu think=%ld cgroup_ino=%llu",
> __entry->bdi,
> __entry->limit,
> + __entry->cg_dirty_cap,
> __entry->setpoint,
> __entry->dirty,
> __entry->wb_setpoint,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c3d98ab41f1f..c43fe4f394eb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4748,6 +4748,56 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
> return nbytes;
> }
>
> +static int memory_dirty_ratio_show(struct seq_file *m, void *v)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
> +
> + seq_printf(m, "%u\n", READ_ONCE(memcg->dirty_ratio));
> + return 0;
> +}
> +
> +static ssize_t memory_dirty_ratio_write(struct kernfs_open_file *of,
> + char *buf, size_t nbytes, loff_t off)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> + unsigned int ratio;
> + int err;
> +
> + err = kstrtouint(strstrip(buf), 0, &ratio);
> + if (err)
> + return err;
> +
> + if (ratio > 100)
> + return -EINVAL;
> +
> + WRITE_ONCE(memcg->dirty_ratio, ratio);
> + return nbytes;
> +}
> +
> +static int memory_dirty_min_show(struct seq_file *m, void *v)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
> +
> + /* seq_puts_memcg_tunable automatically multiplies by PAGE_SIZE for the user */
> + return seq_puts_memcg_tunable(m, READ_ONCE(memcg->dirty_min));
> +}
> +
> +static ssize_t memory_dirty_min_write(struct kernfs_open_file *of,
> + char *buf, size_t nbytes, loff_t off)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> + unsigned long dirty_min;
> + int err;
> +
> + /* page_counter_memparse converts strings like "512M" into a page count */
> + err = page_counter_memparse(strstrip(buf), "max", &dirty_min);
> + if (err)
> + return err;
> +
> + WRITE_ONCE(memcg->dirty_min, dirty_min);
> + return nbytes;
> +}
> +
> /*
> * Note: don't forget to update the 'samples/cgroup/memcg_event_listener'
> * if any new events become available.
> @@ -4950,6 +5000,18 @@ static struct cftype memory_files[] = {
> .flags = CFTYPE_NS_DELEGATABLE,
> .write = memory_reclaim,
> },
> + {
> + .name = "dirty_ratio",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .seq_show = memory_dirty_ratio_show,
> + .write = memory_dirty_ratio_write,
> + },
> + {
> + .name = "dirty_min",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .seq_show = memory_dirty_min_show,
> + .write = memory_dirty_min_write,
> + },
> { } /* terminate */
> };
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 88cd53d4ba09..2847b2c1e59a 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -124,14 +124,17 @@ struct wb_domain global_wb_domain;
>
> #define GDTC_INIT(__wb) .wb = (__wb), \
> .dom = &global_wb_domain, \
> - .wb_completions = &(__wb)->completions
> + .wb_completions = &(__wb)->completions, \
> + .cg_dirty_cap = PAGE_COUNTER_MAX
>
> -#define GDTC_INIT_NO_WB .dom = &global_wb_domain
> +#define GDTC_INIT_NO_WB .dom = &global_wb_domain, \
> + .cg_dirty_cap = PAGE_COUNTER_MAX
>
> #define MDTC_INIT(__wb, __gdtc) .wb = (__wb), \
> .dom = mem_cgroup_wb_domain(__wb), \
> .wb_completions = &(__wb)->memcg_completions, \
> - .gdtc = __gdtc
> + .gdtc = __gdtc, \
> + .cg_dirty_cap = PAGE_COUNTER_MAX
>
> static bool mdtc_valid(struct dirty_throttle_control *dtc)
> {
> @@ -183,8 +186,9 @@ static void wb_min_max_ratio(struct bdi_writeback *wb,
> #else /* CONFIG_CGROUP_WRITEBACK */
>
> #define GDTC_INIT(__wb) .wb = (__wb), \
> - .wb_completions = &(__wb)->completions
> -#define GDTC_INIT_NO_WB
> + .wb_completions = &(__wb)->completions, \
> + .cg_dirty_cap = PAGE_COUNTER_MAX
> +#define GDTC_INIT_NO_WB .cg_dirty_cap = PAGE_COUNTER_MAX
> #define MDTC_INIT(__wb, __gdtc)
>
> static bool mdtc_valid(struct dirty_throttle_control *dtc)
> @@ -392,6 +396,58 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
> bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32;
> thresh += thresh / 4 + global_wb_domain.dirty_limit / 32;
> }
> +
> + /*
> + * Apply the per-memcg dirty_ratio clamp on mdtc (gdtc != NULL
> + * iff @dtc is a memcg dtc). dirty_ratio is scaled against
> + * the memcg's own dirtyable memory (@available_memory), matching
> + * the semantics of vm_dirty_ratio so the two knobs share a base
> + * and compose via a plain min() on thresh. The clamp is keyed
> + * on wb->memcg_css (the inode-owner's memcg) rather than on
> + * current's memcg, so balance_dirty_pages(), wb_over_bg_thresh()
> + * (flusher kworker context), and cgwb_calc_thresh() all see the
> + * same clamped value.
> + *
> + * Published on dtc->cg_dirty_cap as well so hard_dirty_limit()
> + * callers in balance_dirty_pages() can ignore the slower
> + * dom->dirty_limit smoothing when deriving setpoint/
> + * rate-limit from the clamped ceiling.
> + *
> + * Clamp is applied after the rt/dl boost: dirty_ratio is a
> + * strict override, not widened by priority. bg_thresh is
> + * scaled by the same factor we apply to thresh so the
> + * user-configured bg/thresh ratio survives clamping instead
> + * of snapping to thresh/2 via the bg_thresh >= thresh guard
> + * below. mult_frac() preserves precision for small memcgs
> + * where a plain "(avail / 100) * ratio" would collapse to 0.
> + */
> + if (gdtc) {
> + struct mem_cgroup *memcg =
> + mem_cgroup_from_css(dtc->wb->memcg_css);
> + unsigned int cg_ratio = memcg ?
> + READ_ONCE(memcg->dirty_ratio) : 0;
> +
> + /*
> + * dtc is reused across balance_dirty_pages() iterations,
> + * so reset the published clamp every call -- an admin
> + * clearing memory.dirty_ratio mid-flight must take effect
> + * on the next pass.
> + */
> + dtc->cg_dirty_cap = PAGE_COUNTER_MAX;
> +
> + if (cg_ratio) {
> + unsigned long cg_thresh = mult_frac(available_memory,
> + cg_ratio, 100);
> +
> + if (cg_thresh < thresh) {
> + bg_thresh = mult_frac(bg_thresh, cg_thresh,
> + thresh);
> + thresh = cg_thresh;
> + dtc->cg_dirty_cap = cg_thresh;
> + }
> + }
> + }
> +
> /*
> * Dirty throttling logic assumes the limits in page units fit into
> * 32-bits. This gives 16TB dirty limits max which is hopefully enough.
> @@ -1065,7 +1121,9 @@ static void wb_position_ratio(struct dirty_throttle_control *dtc)
> struct bdi_writeback *wb = dtc->wb;
> unsigned long write_bw = READ_ONCE(wb->avg_write_bandwidth);
> unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh);
> - unsigned long limit = dtc->limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh);
> + unsigned long limit = dtc->limit = min(hard_dirty_limit(dtc_dom(dtc),
> + dtc->thresh),
> + dtc->cg_dirty_cap);
> unsigned long wb_thresh = dtc->wb_thresh;
> unsigned long x_intercept;
> unsigned long setpoint; /* dirty pages' target balance point */
> @@ -1334,7 +1392,8 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc,
> struct bdi_writeback *wb = dtc->wb;
> unsigned long dirty = dtc->dirty;
> unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh);
> - unsigned long limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh);
> + unsigned long limit = min(hard_dirty_limit(dtc_dom(dtc), dtc->thresh),
> + dtc->cg_dirty_cap);
> unsigned long setpoint = (freerun + limit) / 2;
> unsigned long write_bw = wb->avg_write_bandwidth;
> unsigned long dirty_ratelimit = wb->dirty_ratelimit;
> @@ -1822,22 +1881,122 @@ static int balance_dirty_pages(struct bdi_writeback *wb,
> int ret = 0;
>
> for (;;) {
> + unsigned long cg_dirty_min = 0;
> + unsigned long cg_dirty_pages = 0;
> unsigned long now = jiffies;
>
> nr_dirty = global_node_page_state(NR_FILE_DIRTY);
>
> balance_domain_limits(gdtc, strictlimit);
> +
> + /*
> + * Under RCU, snapshot the current memcg's memory.dirty_min
> + * reservation. When it is non-zero, also snapshot the
> + * memcg-wide dirty backlog. These feed the per-writer
> + * dirty_min bypass below; the dirty_ratio clamp itself
> + * is applied inside domain_dirty_limits() keyed on
> + * wb->memcg_css so balance_dirty_pages(),
> + * wb_over_bg_thresh() (flusher kworker context), and
> + * cgwb_calc_thresh() all see a consistent clamped
> + * threshold.
> + *
> + * rcu_read_lock() is held only for the __rcu dereference
> + * of current->cgroups; the memcg pointer does not escape
> + * the critical section. The counter read matches
> + * domain_dirty_avail(mdtc, true) so the bypass compares
> + * the same dirty+in-flight backlog the global path uses.
> + */
> + rcu_read_lock();
> + {
> + struct mem_cgroup *memcg =
> + mem_cgroup_from_task(current);
> +
> + if (memcg) {
> + cg_dirty_min = READ_ONCE(memcg->dirty_min);
> + if (cg_dirty_min)
> + cg_dirty_pages =
> + memcg_page_state(memcg,
> + NR_FILE_DIRTY) +
> + memcg_page_state(memcg,
> + NR_WRITEBACK);
> + }
> + }
> + rcu_read_unlock();
> +
> if (mdtc) {
> /*
> - * If @wb belongs to !root memcg, repeat the same
> - * basic calculations for the memcg domain.
> + * For !root memcg, repeat the same three-step
> + * sequence as balance_domain_limits(gdtc):
> + * avail -> limits -> freerun. We inline it here
> + * so we can insert the mdtc->dirty override
> + * between step 2 (domain_dirty_limits, which
> + * publishes the per-memcg dirty_ratio clamp on
> + * cg_dirty_cap) and step 3 (domain_dirty_freerun,
> + * which consumes mdtc->dirty along with
> + * thresh/bg_thresh).
> + */
> + domain_dirty_avail(mdtc, true);
> + domain_dirty_limits(mdtc);
> +
> + /*
> + * When the dirty_ratio clamp engaged, replace the
> + * per-wb dirty count from mem_cgroup_wb_stats()
> + * with the memcg-wide NR_FILE_DIRTY + NR_WRITEBACK
> + * sum so freerun, the setpoint, and the rate-limit
> + * smoothing see the true memcg backlog instead of
> + * the subset that has migrated to this cgwb (cgwb
> + * migration is lazy and can lag by many seconds),
> + * and so a burst of buffered writes cannot silently
> + * bypass the clamp by shifting pages from
> + * NR_FILE_DIRTY into NR_WRITEBACK.
> + *
> + * Keyed on wb->memcg_css to match the clamp itself.
> + * The cgwb holds a css reference, so the memcg
> + * pointer is stable without additional locking.
> + *
> + * Caveat: memcg_page_state() aggregates across ALL
> + * backing devices owned by this memcg, while mdtc
> + * is scoped to one wb. A writer to a fast BDI may
> + * observe backlog accumulated on slow BDIs in the
> + * same memcg and throttle more than strictly needed.
> + * Accepted for v1; the alternative (summing per-wb
> + * dirty across the memcg's cgwbs) walks the cgwb
> + * list under RCU on a hot path.
> */
> - balance_domain_limits(mdtc, strictlimit);
> + if (mdtc->cg_dirty_cap != PAGE_COUNTER_MAX) {
> + struct mem_cgroup *wb_memcg =
> + mem_cgroup_from_css(mdtc->wb->memcg_css);
> +
> + if (wb_memcg)
> + mdtc->dirty =
> + memcg_page_state(wb_memcg,
> + NR_FILE_DIRTY) +
> + memcg_page_state(wb_memcg,
> + NR_WRITEBACK);
> + }
> +
> + domain_dirty_freerun(mdtc, strictlimit);
> }
>
> if (nr_dirty > gdtc->bg_thresh && !writeback_in_progress(wb))
> wb_start_background_writeback(wb);
>
> + /*
> + * dirty_min bypass: when the current memcg's dirty+in-flight
> + * backlog is below its memory.dirty_min reservation, let the
> + * writer proceed without throttling. This check must live
> + * outside the if (mdtc) block because a writer's file may not
> + * yet have been migrated to a cgwb; without cgwb, mdtc is NULL
> + * and the per-memcg block above is skipped entirely.
> + *
> + * cg_dirty_min and cg_dirty_pages come from the per-iteration
> + * snapshot taken above under rcu_read_lock; both are stored
> + * in pages (page_counter_memparse converts bytes -> pages for
> + * dirty_min), so no unit conversion is needed.
> + */
> + if (cg_dirty_min && cg_dirty_pages < cg_dirty_min)
> + goto free_running;
> +
> /*
> * If memcg domain is in effect, @dirty should be under
> * both global and memcg freerun ceilings.
>
> ---
> base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
> change-id: 20260501-rfc-memcg-dirty-v1-ed4644c3fa8a
>
> Best regards,
> --
> Alireza Haghdoost <haghdoost@xxxxxxxx>
>
>
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR