[RFC PATCH v2 2/9] memcg: charge kmem pages and slab objects against per-node objcg
From: Alexandre Ghiti
Date: Fri Jun 26 2026 - 06:15:26 EST
From: Shakeel Butt <shakeel.butt@xxxxxxxxx>
After 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg
per-node type") current_obj_cgroup() returns the per-node objcg of the
task's memcg for numa_node_id(). Two callers in the kmem accounting
path always know the actual target node — the page being charged in
__memcg_kmem_charge_page() and each slab being charged in
__memcg_slab_post_alloc_hook() — but were using current_obj_cgroup()
and so charged against an objcg whose nid did not match the
allocation's physical node. The per-objcg vmstat batching (keyed by
objcg->nid) and per-node charge attribution were both routed to the
wrong sibling of the same memcg whenever the allocating CPU's node
differed from the allocation's node.
Factor the per-node objcg lookup into __current_obj_cgroup(int nid)
and keep current_obj_cgroup() as a one-line wrapper that passes
numa_node_id(), preserving all other callers. Use the new helper in:
- __memcg_kmem_charge_page(): pass page_to_nid(page).
- __memcg_slab_post_alloc_hook(): re-fetch inside the loop using
slab_nid(slab) so each slab in a bulk allocation is charged
against its own node's objcg. The early per-task root/NULL check
above the loop remains (all per-node objcgs of a memcg share the
same root-ness, so it is still a valid fast path); the in-loop
check guards the transient drain window where one node's entry
may be NULL.
Update the stale slab_pgdat(slab) reference in the TODO comment to
slab_nid(slab); slab_pgdat is no longer relevant after the
obj_stock_pcp cached_pgdat removal.
Signed-off-by: Shakeel Butt <shakeel.butt@xxxxxxxxx>
Signed-off-by: Alexandre Ghiti <alex@xxxxxxxx>
---
mm/memcontrol.c | 26 +++++++++++++++++++++-----
1 file changed, 21 insertions(+), 5 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ee47427de9e2..3bcc20e72914 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2987,12 +2987,11 @@ static struct obj_cgroup **current_objcg_update(void)
return objcgs;
}
-__always_inline struct obj_cgroup *current_obj_cgroup(void)
+__always_inline static struct obj_cgroup *__current_obj_cgroup(int nid)
{
struct mem_cgroup *memcg;
struct obj_cgroup *objcg;
struct obj_cgroup **objcgs;
- int nid = numa_node_id();
if (IS_ENABLED(CONFIG_MEMCG_NMI_UNSAFE) && in_nmi())
return NULL;
@@ -3036,6 +3035,11 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void)
return rcu_dereference_check(root_mem_cgroup->nodeinfo[nid]->objcg, 1);
}
+__always_inline struct obj_cgroup *current_obj_cgroup(void)
+{
+ return __current_obj_cgroup(numa_node_id());
+}
+
struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
{
struct obj_cgroup *objcg;
@@ -3143,7 +3147,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
struct obj_cgroup *objcg;
int ret = 0;
- objcg = current_obj_cgroup();
+ objcg = __current_obj_cgroup(page_to_nid(page));
if (objcg && !obj_cgroup_is_root(objcg)) {
ret = obj_cgroup_charge_pages(objcg, gfp, 1 << order);
if (!ret) {
@@ -3536,7 +3540,9 @@ static inline size_t obj_full_size(struct kmem_cache *s)
{
/*
* For each accounted object there is an extra space which is used
- * to store obj_cgroup membership. Charge it too.
+ * to store obj_cgroup membership. Charge it too. In addition, we
+ * allocate obj_exts array on the same node as slab_nid(), so per-node
+ * kmem accounting is fine.
*/
return s->size + sizeof(struct obj_cgroup *);
}
@@ -3594,6 +3600,16 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
continue;
}
+ /*
+ * Charge against the per-node objcg matching the slab's node
+ * so the stock's per-objcg vmstat batch (keyed by objcg->nid)
+ * aligns with the physical slab. May transiently fall back to
+ * root if the per-node entry is being drained.
+ */
+ objcg = __current_obj_cgroup(slab_nid(slab));
+ if (!objcg || obj_cgroup_is_root(objcg))
+ continue;
+
/*
* if we fail and size is 1, memcg_alloc_abort_single() will
* just free the object, which is ok as we have not assigned
@@ -3602,7 +3618,7 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
* for larger sizes, kmem_cache_free_bulk() will uncharge
* any objects that were already charged and obj_ext assigned
*
- * TODO: we could batch this until slab_pgdat(slab) changes
+ * TODO: we could batch this until slab_nid(slab) changes
* between iterations, with a more complicated undo
*/
stock = trylock_stock();
--
2.54.0