Re: [PATCH v3] memcg: cache obj_stock by memcg, not by objcg pointer
From: Harry Yoo
Date: Tue May 19 2026 - 11:00:34 EST
On 5/19/26 11:02 PM, Shakeel Butt wrote:
On Tue, May 19, 2026 at 03:46:51PM +0900, Harry Yoo wrote:
On 5/19/26 8:41 AM, Shakeel Butt wrote:
On Mon, May 18, 2026 at 03:28:27PM -0700, Shakeel Butt wrote:
Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg
per-node type") split a memcg's single obj_cgroup into one per NUMA
node, but the per-CPU obj_stock_pcp still keys cached_objcg by
pointer. Cross-NUMA workloads now see a drain on every refill and a
miss on every consume that targets a sibling per-node objcg of the
same memcg, producing the 67.7% stress-ng switch-mq regression
reported by LKP.
stock->nr_bytes are fungible across per-node objcgs of one memcg.
Treat the cache as keyed by memcg in __consume_obj_stock() and
__refill_obj_stock() so siblings share the reserve. Compare via
READ_ONCE(objcg->memcg) directly: pointer-compare only, no deref, so
the rcu_read_lock contract on obj_cgroup_memcg() does not apply.
Sharing the reserve without re-caching means bytes funded by one
per-node objcg's slow path can be consumed/freed under a different
sibling, leaving sub-page residue on whichever sibling was cached at
drain time. The pre-existing obj_cgroup_release() path would WARN and
silently drop that residue, leaking up to nr_node_ids * (PAGE_SIZE - 1)
bytes per memcg lifecycle from the page_counter. Forward the residue
into a per-node objcg of the same (post-reparent) memcg at release time
instead, so it can be reconciled later via a refill atomic_xchg or
another release; the chain terminates at root_mem_cgroup, whose
page_counter has no enforced limit.
Please note that this is temporary fix and will be reverted when
per-node kmem accounting is introduced.
... because once per-node kmem accounting is introduced,
"stock->nr_bytes are fungible across per-node objcgs of one memcg"
no longer holds?
Yes
And the follow-up plain is to revert this and address it with a multi-objcg
percpu stock [1], similar to a multi-memcg percpu charge cache we have now,
right? (regardless of per-node kmem accounting's progress)
Yes
Thanks for confirming!
If this temporary fix imposes other potential correctness issues, would it
make sense to land [1] in mainline before the next LTS release and skip this
temporary fix?
[1] https://lore.kernel.org/oe-lkp/agtPMpQK2jXdQAY4@xxxxxxxxx
The full clean solution might take one more cycle and I think we can not just
ignore 67% regression on 7.1.
That is valid point, unfortunately.
One more thing I have to ask... for v7.1, wouldn't it be a safer option to revert the per-node object change and re-introduce it once we have a cleaner solution?
This change was introduced in v5, but the implementation before v4 had been exposed in -next for a while, and I think we don't have enough justification to keep the per-node objcgs change, at least for v7.1, given that we have an unexpected last-minute regression and
correctness concerns (albeit slight).
--
Cheers,
Harry / Hyeonggon