Re: [PATCH] memcg: introduce kfuncs for fetching memcg stats
From: JP Kobryn
Date: Mon Oct 06 2025 - 16:33:09 EST
On 10/1/25 3:25 PM, Alexei Starovoitov wrote:
On Tue, Sep 30, 2025 at 9:57 PM JP Kobryn <inwardvessel@xxxxxxxxx> wrote:[..]
There is a significant perf benefit when using this approach. In terms of
elapsed time, the kfuncs allow a bpf cgroup iterator program to outperform
the traditional file reading method, saving almost 80% of the time spent in
kernel.
control: elapsed time
real 0m14.421s
user 0m0.183s
sys 0m14.184s
experiment: elapsed time
real 0m3.250s
user 0m0.225s
sys 0m2.916s
Nice, but github repo somewhere doesn't guarantee that
the work is equivalent.
Please add it as a selftest/bpf instead.
Like was done in commit
https://lore.kernel.org/bpf/20200509175921.2477493-1-yhs@xxxxxx/
to demonstrate equivalence of 'cat /proc' vs iterator approach.
Sure, I'll relocate the test code there.
[..]
---
mm/memcontrol.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 67 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8dd7fbed5a94..aa8cbf883d71 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -870,6 +870,73 @@ unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
}
#endif
+static inline struct mem_cgroup *memcg_from_cgroup(struct cgroup *cgrp)
+{
+ return cgrp ? mem_cgroup_from_css(cgrp->subsys[memory_cgrp_id]) : NULL;
+}
+
+__bpf_kfunc static void memcg_flush_stats(struct cgroup *cgrp)
+{
+ struct mem_cgroup *memcg = memcg_from_cgroup(cgrp);
+
+ if (!memcg)
+ return;
+
+ mem_cgroup_flush_stats(memcg);
+}
css_rstat_flush() is sleepable, so this kfunc must be sleepable too.
Not sure about the rest.
Good catch. I'll add the sleepable flag where it's needed.
+
+__bpf_kfunc static unsigned long memcg_node_stat_fetch(struct cgroup *cgrp,
+ enum node_stat_item item)
+{
+ struct mem_cgroup *memcg = memcg_from_cgroup(cgrp);
+
+ if (!memcg)
+ return 0;
+
+ return memcg_page_state_output(memcg, item);
+}
+
+__bpf_kfunc static unsigned long memcg_stat_fetch(struct cgroup *cgrp,
+ enum memcg_stat_item item)
+{
+ struct mem_cgroup *memcg = memcg_from_cgroup(cgrp);
+
+ if (!memcg)
+ return 0;
+
+ return memcg_page_state_output(memcg, item);
+}
+
+__bpf_kfunc static unsigned long memcg_vm_event_fetch(struct cgroup *cgrp,
+ enum vm_event_item item)
+{
+ struct mem_cgroup *memcg = memcg_from_cgroup(cgrp);
+
+ if (!memcg)
+ return 0;
+
+ return memcg_events(memcg, item);
+}
+
+BTF_KFUNCS_START(bpf_memcontrol_kfunc_ids)
+BTF_ID_FLAGS(func, memcg_flush_stats)
+BTF_ID_FLAGS(func, memcg_node_stat_fetch)
+BTF_ID_FLAGS(func, memcg_stat_fetch)
+BTF_ID_FLAGS(func, memcg_vm_event_fetch)
+BTF_KFUNCS_END(bpf_memcontrol_kfunc_ids)
At least one of them must be sleepable and the rest probably too?
All of them must be KF_TRUSTED_ARGS too.
Thanks, I'll include the trusted args flag. As to which are sleepable,
only memcg_flush_stats can block.
+
+static const struct btf_kfunc_id_set bpf_memcontrol_kfunc_set = {
+ .owner = THIS_MODULE,
+ .set = &bpf_memcontrol_kfunc_ids,
+};
+
+static int __init bpf_memcontrol_kfunc_init(void)
+{
+ return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING,
+ &bpf_memcontrol_kfunc_set);
+}
Why tracing only?
Hmmm, initially I didn't think about use cases outside of the cgroup
iterator programs. After discussing with teammates though, some other
potential use cases could be within sched_ext or (future) bpf-oom. I'm
thinking I'll go with the "UNSPEC" type in v2.