Re: [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file

From: JP Kobryn
Date: Mon Nov 10 2025 - 14:30:09 EST


On 11/9/25 10:20 PM, Leon Huang Fu wrote:
On Fri, Nov 7, 2025 at 1:02 AM JP Kobryn <inwardvessel@xxxxxxxxx> wrote:

On 11/4/25 11:49 PM, Leon Huang Fu wrote:
On high-core count systems, memory cgroup statistics can become stale
due to per-CPU caching and deferred aggregation. Monitoring tools and
management applications sometimes need guaranteed up-to-date statistics
at specific points in time to make accurate decisions.

This patch adds write handlers to both memory.stat and memory.numa_stat
files to allow userspace to explicitly force an immediate flush of
memory statistics. When "1" is written to either file, it triggers
__mem_cgroup_flush_stats(memcg, true), which unconditionally flushes
all pending statistics for the cgroup and its descendants.

The write operation validates the input and only accepts the value "1",
returning -EINVAL for any other input.

Usage example:
# Force immediate flush before reading critical statistics
echo 1 > /sys/fs/cgroup/mygroup/memory.stat
cat /sys/fs/cgroup/mygroup/memory.stat

This provides several benefits:

1. On-demand accuracy: Tools can flush only when needed, avoiding
continuous overhead

2. Targeted flushing: Allows flushing specific cgroups when precision
is required for particular workloads

I'm curious about your use case. Since you mention required precision,
are you planning on manually flushing before every read?


Yes, for our use case, manual flushing before critical reads is necessary.
We're going to run on high-core count servers (224-256 cores), where the
per-CPU batching threshold (MEMCG_CHARGE_BATCH * num_online_cpus) can
accumulate up to 16,384 events (on 256 cores) before an automatic flush is
triggered. This means memory statistics can be likely stale, often exceeding
acceptable tolerance for critical memory management decisions.

Our monitoring tools don't need to flush on every read - only when making
critical decisions like OOM adjustments, container placement, or resource
limit enforcement. The opt-in nature of this mechanism allows us to pay the
flush cost only when precision is truly required.


3. Integration flexibility: Monitoring scripts can decide when to pay
the flush cost based on their specific accuracy requirements

[...]
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c34029e92bab..d6a5d872fbcb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4531,6 +4531,17 @@ int memory_stat_show(struct seq_file *m, void *v)
return 0;
}

+int memory_stat_write(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+{
+ if (val != 1)
+ return -EINVAL;
+
+ if (css)
+ css_rstat_flush(css);

This is a kfunc. You can do this right now from a bpf program without
any kernel changes.


While css_rstat_flush() is indeed available as a BPF kfunc, the practical
challenge is determining when to call it. The natural hook point would be
memory_stat_show() using fentry, but this runs into a BPF verifier
limitation: the function's 'struct seq_file *' argument doesn't provide a
trusted path to obtain the 'struct cgroup_subsys_state *css' pointer
required by css_rstat_flush().

Ok, I see this would only work on the css for base stats.

SEC("iter.s/cgroup")
int cgroup_memcg_query(struct bpf_iter__cgroup *ctx)
{
struct cgroup *cgrp = ctx->cgroup;
struct cgroup_subsys_state *css;

if (!cgrp)
return 1;

/* example of flushing css for base cpu stats
* css = container_of(cgrp, struct cgroup_subsys_state, cgroup);
* if (!css)
* return 1;
* css_rstat_flush(css);
*/

/* get css for memcg stats */
css = cgrp->subsys[memory_cgrp_id];
if (!css)
return 1;
css_rstat_flush(css); <- confirm untrusted pointer arg error
...


I attempted to implement this via BPF (code below), but it fails
verification because deriving the css pointer through
seq->private->kn->parent->priv results in an untrusted scalar that the
verifier rejects for the kfunc call:

R1 invalid mem access 'scalar'

The verifier error occurs because:
1. seq->private is rdonly_untrusted_mem
2. Dereferencing through kernfs_node internals produces untracked pointers
3. css_rstat_flush() requires a trusted css pointer per its kfunc definition

A direct userspace interface (memory.stat_refresh) avoids these verifier
limitations and provides a cleaner, more maintainable solution that doesn't
require BPF expertise or complex workarounds.

This is subjective. After hearing more about your use case and how you
mention making critical decisions, you should have a look at the work
being done on BPF OOM [0][1]. I think you would benefit from this
series. Specifically for your case it provides the ability to flush
memcg on demand and also fetch stats.

[0] https://lore.kernel.org/all/20251027231727.472628-1-roman.gushchin@xxxxxxxxx/
[1] https://lore.kernel.org/all/20251027232206.473085-2-roman.gushchin@xxxxxxxxx/