Re: [PATCH v3] memcg: cache obj_stock by memcg, not by objcg pointer

From: Harry Yoo

Date: Tue May 19 2026 - 02:52:58 EST

On 5/19/26 8:41 AM, Shakeel Butt wrote:

On Mon, May 18, 2026 at 03:28:27PM -0700, Shakeel Butt wrote:

Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg
per-node type") split a memcg's single obj_cgroup into one per NUMA
node, but the per-CPU obj_stock_pcp still keys cached_objcg by
pointer. Cross-NUMA workloads now see a drain on every refill and a
miss on every consume that targets a sibling per-node objcg of the
same memcg, producing the 67.7% stress-ng switch-mq regression
reported by LKP.

stock->nr_bytes are fungible across per-node objcgs of one memcg.
Treat the cache as keyed by memcg in __consume_obj_stock() and
__refill_obj_stock() so siblings share the reserve. Compare via
READ_ONCE(objcg->memcg) directly: pointer-compare only, no deref, so
the rcu_read_lock contract on obj_cgroup_memcg() does not apply.

Sharing the reserve without re-caching means bytes funded by one
per-node objcg's slow path can be consumed/freed under a different
sibling, leaving sub-page residue on whichever sibling was cached at
drain time. The pre-existing obj_cgroup_release() path would WARN and
silently drop that residue, leaking up to nr_node_ids * (PAGE_SIZE - 1)
bytes per memcg lifecycle from the page_counter. Forward the residue
into a per-node objcg of the same (post-reparent) memcg at release time
instead, so it can be reconciled later via a refill atomic_xchg or
another release; the chain terminates at root_mem_cgroup, whose
page_counter has no enforced limit.

Please note that this is temporary fix and will be reverted when
per-node kmem accounting is introduced.

... because once per-node kmem accounting is introduced,
"stock->nr_bytes are fungible across per-node objcgs of one memcg"
no longer holds?

And the follow-up plain is to revert this and address it with a multi-objcg percpu stock [1], similar to a multi-memcg percpu charge cache we have now, right? (regardless of per-node kmem accounting's progress)

If this temporary fix imposes other potential correctness issues, would it make sense to land [1] in mainline before the next LTS release and skip this temporary fix?

[1] https://lore.kernel.org/oe-lkp/agtPMpQK2jXdQAY4@xxxxxxxxx

--
Cheers,
Harry / Hyeonggon