Re: [PATCH 0/8] per-memcg-per-node kmem accounting
From: Alexandre Ghiti
Date: Thu May 21 2026 - 09:14:28 EST
On 5/21/26 05:46, Joshua Hahn wrote:
On Wed, 20 May 2026 10:39:59 +0200 Alexandre Ghiti <alex@xxxxxxxx> wrote:
Hi Joshua,Hi Alex,
On 5/18/26 16:57, Joshua Hahn wrote:
On Mon, 11 May 2026 22:20:35 +0200 Alexandre Ghiti <alex@xxxxxxxx> wrote:Indeed, let me microbenchmark the overhead on a large system.
This series pursues the work initiated by Joshua [1]. We need kernelHello Alex,
memory to be accounted on a per-node basis in order to be able to
know the memcg and physical memory association.
This series takes advantage of the recent introduction of per-node
obj_cgroup [2] and makes those obj_cgroup tied to their numa node.
The bulk of the series is percpu per-node accounting: percpu
"precharges" the memcg before we know the actual location of the pages
it uses, so charging and accounting had to be split. All other kmem
users (slab, zswap, __memcg_kmem_charge_page) are straightforward
conversions (zswap support is limited in this series because Joshua
is working on it in parallel [3]).
Thanks Joshua for your early feedbacks!
Thank you for your work!
Overall I think the direction makes sense to me. Pre-overcharging makes sense to
me as an approach, we would much rather overaccount than underaccount and
later have to breach limits.
I do have some concerns on performance, though. Namely, I think there are
some expensive operations that I think would benefit from some performane
benchmarking with this patch added (maybe some simple microbenchmarks that
demonstrates kernel allocation overhead could be useful).
From what I can tell, there is some additional performance overhead that has
to do with iterating over num_possible_cpus() x pages_per_alloc, which
doesn't seem trivial to me.
That sounds great with me : -) Looking forward to the numbers!
So in my initial scan of the patch 7 I had a concern that if we have a nestedAnother concern that I see is the stock credit system. Maybe we could beI'm not following on this one, which atomic operations do you see that
bypassing the stock check leading to more time spent doing the atomic
operations.
could be bypassed?
stock system (obj_cgroup stock and local credit "stock"), then we could
incur more work if these are out of sync; do extra work in the stock refill
path in obj_cgroup_precharge, and then do extra work on top in the loop
within the pcpu_memcg_post_alloc_hook (obj_cgroup_account_kmem does the
charging atomically I think).
So what I mean is, I'm not sure what the "size" is typically for
pcpu_memcg_post_alloc_hook. But it might be a worthwhile optimization to
do precharge all the pages, then for each cpu iterate over the pages to
figure out how many pages are used per nid (doing just math, not actually
doing the atomic adds), and then outside both of these loops just iterate
over every nid_objcg once to perform the atomic operation.
Maybe this is needed or not (depending on how big "size" typically is
and whether we go from doing O(1000) atomic adds --> O(10) or some
big reduction, but I just wanted to toss it out there as something that
could potentially be expensive.
I get it, I'll trace the microbenchmarks to see what happens there, thanks for the suggestion.
Thanks again,
Alex
Whoops O_o I completely missed that one. Sorry for flagging it again!obj_stock caches a single obj_cgroup, which means that if we split the objcg
to be per-node (in patch 6), then the obj_stock basically gets invalidated
every operation since we iterate over more objcgs (even though we are in
the same logical objcg). Maybe I'm missing something?
The objcg split comes from commit 01b9da291c49 ("mm: memcontrol: convert
objcg to be per-memcg per-node type") and the problem you describe is
exactly what Shakeel is trying to fix [1].
But I remember trying a microbenchmark and noticed a +5% regression (onSounds like a great idea! Thanks again Alex, have a great day! : -)
top of the 67% then...), I'll rebase this series on top of Shakeel's and
re-run.
Joshua