Re: [PATCH 0/8] per-memcg-per-node kmem accounting
From: Joshua Hahn
Date: Mon May 18 2026 - 11:13:13 EST
On Mon, 11 May 2026 22:20:35 +0200 Alexandre Ghiti <alex@xxxxxxxx> wrote:
> This series pursues the work initiated by Joshua [1]. We need kernel
> memory to be accounted on a per-node basis in order to be able to
> know the memcg and physical memory association.
>
> This series takes advantage of the recent introduction of per-node
> obj_cgroup [2] and makes those obj_cgroup tied to their numa node.
>
> The bulk of the series is percpu per-node accounting: percpu
> "precharges" the memcg before we know the actual location of the pages
> it uses, so charging and accounting had to be split. All other kmem
> users (slab, zswap, __memcg_kmem_charge_page) are straightforward
> conversions (zswap support is limited in this series because Joshua
> is working on it in parallel [3]).
>
> Thanks Joshua for your early feedbacks!
Hello Alex,
Thank you for your work!
Overall I think the direction makes sense to me. Pre-overcharging makes sense to
me as an approach, we would much rather overaccount than underaccount and
later have to breach limits.
I do have some concerns on performance, though. Namely, I think there are
some expensive operations that I think would benefit from some performane
benchmarking with this patch added (maybe some simple microbenchmarks that
demonstrates kernel allocation overhead could be useful).
>From what I can tell, there is some additional performance overhead that has
to do with iterating over num_possible_cpus() x pages_per_alloc, which
doesn't seem trivial to me.
Another concern that I see is the stock credit system. Maybe we could be
bypassing the stock check leading to more time spent doing the atomic
operations.
obj_stock caches a single obj_cgroup, which means that if we split the objcg
to be per-node (in patch 6), then the obj_stock basically gets invalidated
every operation since we iterate over more objcgs (even though we are in
the same logical objcg). Maybe I'm missing something?
I haven't taken a deep look at the implementation details but just wanted to
raise some high level items that I noticed. Of course, all of these concerns
are just theoretical, if you can show that the performance delta is not
noticable then all of my concerns don't matter.
I also want to talk more about the local credit system but let's first see
what the numbers are first.
Thanks again, Alex. And I really like patch 2 because it is a solution to
a problem that I ran into in my percpu tracking series that I couldn't think
of before! Thank you for solving my problem too : -)
Have a great day!
Joshua
> [1] https://lore.kernel.org/linux-mm/20260404033844.1892595-1-joshua.hahnjy@xxxxxxxxx/
> [2] https://lore.kernel.org/linux-mm/56c04b1c5d54f75ccdc12896df6c1ca35403ecc3.1772711148.git.zhengqi.arch@xxxxxxxxxxxxx/
> [3] https://lore.kernel.org/linux-mm/20260311195153.4013476-1-joshua.hahnjy@xxxxxxxxx/
>
> Alexandre Ghiti (8):
> mm: memcontrol: propagate NMI slab stats to memcg vmstats
> mm: percpu: charge obj_exts allocation with __GFP_ACCOUNT
> mm: percpu: Split memcg charging and kmem accounting
> mm: memcontrol: track MEMCG_KMEM per NUMA node
> mm: memcontrol: per-node kmem accounting for page charges
> mm: slab: per-node kmem accounting for slab
> mm: percpu: per-node kmem accounting using local credit
> mm: zswap: per-node kmem accounting for zswap/zsmalloc
>
> include/linux/memcontrol.h | 27 +++++--
> include/linux/mmzone.h | 1 +
> include/linux/zsmalloc.h | 2 +
> mm/memcontrol.c | 150 ++++++++++++++++++++++++++++---------
> mm/percpu-internal.h | 16 +---
> mm/percpu.c | 90 ++++++++++++++++++++--
> mm/vmstat.c | 1 +
> mm/zsmalloc.c | 11 +++
> mm/zswap.c | 9 ++-
> 9 files changed, 242 insertions(+), 65 deletions(-)
>
> --
> 2.54.0
>
>