Re: [PATCH 5/5] mm: memcg: separate slab stat accounting from objcg charge cache

From: Johannes Weiner

Date: Tue Mar 03 2026 - 10:43:45 EST

On Tue, Mar 03, 2026 at 05:45:18AM -0800, Shakeel Butt wrote:
> On Tue, Mar 03, 2026 at 11:42:31AM +0100, Vlastimil Babka (SUSE) wrote:
> > On 3/3/26 09:54, Hao Li wrote:
> > > On Mon, Mar 02, 2026 at 02:50:18PM -0500, Johannes Weiner wrote:
> > >> Cgroup slab metrics are cached per-cpu the same way as the sub-page
> > >> charge cache. However, the intertwined code to manage those dependent
> > >> caches right now is quite difficult to follow.
> > >>
> > >> Specifically, cached slab stat updates occur in consume() if there was
> > >> enough charge cache to satisfy the new object. If that fails, whole
> > >> pages are reserved, and slab stats are updated when the remainder of
> > >> those pages, after subtracting the size of the new slab object, are
> > >> put into the charge cache. This already juggles a delicate mix of the
> > >> object size, the page charge size, and the remainder to put into the
> > >> byte cache. Doing slab accounting in this path as well is fragile, and
> > >> has recently caused a bug where the input parameters between the two
> > >> caches were mixed up.
> > >>
> > >> Refactor the consume() and refill() paths into unlocked and locked
> > >> variants that only do charge caching. Then let the slab path manage
> > >> its own lock section and open-code charging and accounting.
> > >>
> > >> This makes the slab stat cache subordinate to the charge cache:
> > >> __refill_obj_stock() is called first to prepare it;
> > >> __account_obj_stock() follows to hitch a ride.
> > >>
> > >> This results in a minor behavioral change: previously, a mismatching
> > >> percpu stock would always be drained for the purpose of setting up
> > >> slab account caching, even if there was no byte remainder to put into
> > >> the charge cache. Now, the stock is left alone, and slab accounting
> > >> takes the uncached path if there is a mismatch. This is exceedingly
> > >> rare, and it was probably never worth draining the whole stock just to
> > >> cache the slab stat update.
> > >>
> > >> Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> > >> ---
> > >> mm/memcontrol.c | 100 +++++++++++++++++++++++++++++-------------------
> > >> 1 file changed, 61 insertions(+), 39 deletions(-)
> > >>
> > >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > >> index 4f12b75743d4..9c6f9849b717 100644
> > >> --- a/mm/memcontrol.c
> > >> +++ b/mm/memcontrol.c
> > >> @@ -3218,16 +3218,18 @@ static struct obj_stock_pcp *trylock_stock(void)
> > >>
> > >
> > > [...]
> > >
> > >> @@ -3376,17 +3383,14 @@ static bool obj_stock_flush_required(struct obj_stock_pcp *stock,
> > >> return flush;
> > >> }
> > >>
> > >> -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
> > >> - bool allow_uncharge, int nr_acct, struct pglist_data *pgdat,
> > >> - enum node_stat_item idx)
> > >> +static void __refill_obj_stock(struct obj_cgroup *objcg,
> > >> + struct obj_stock_pcp *stock,
> > >> + unsigned int nr_bytes,
> > >> + bool allow_uncharge)
> > >> {
> > >> - struct obj_stock_pcp *stock;
> > >> unsigned int nr_pages = 0;
> > >>
> > >> - stock = trylock_stock();
> > >> if (!stock) {
> > >> - if (pgdat)
> > >> - __account_obj_stock(objcg, NULL, nr_acct, pgdat, idx);
> > >> nr_pages = nr_bytes >> PAGE_SHIFT;
> > >> nr_bytes = nr_bytes & (PAGE_SIZE - 1);
> > >> atomic_add(nr_bytes, &objcg->nr_charged_bytes);
> > >> @@ -3404,20 +3408,25 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
> > >> }
> > >> stock->nr_bytes += nr_bytes;
> > >>
> > >> - if (pgdat)
> > >> - __account_obj_stock(objcg, stock, nr_acct, pgdat, idx);
> > >> -
> > >> if (allow_uncharge && (stock->nr_bytes > PAGE_SIZE)) {
> > >> nr_pages = stock->nr_bytes >> PAGE_SHIFT;
> > >> stock->nr_bytes &= (PAGE_SIZE - 1);
> > >> }
> > >>
> > >> - unlock_stock(stock);
> > >> out:
> > >> if (nr_pages)
> > >> obj_cgroup_uncharge_pages(objcg, nr_pages);
> > >> }
> > >>
> > >> +static void refill_obj_stock(struct obj_cgroup *objcg,
> > >> + unsigned int nr_bytes,
> > >> + bool allow_uncharge)
> > >> +{
> > >> + struct obj_stock_pcp *stock = trylock_stock();
> > >> + __refill_obj_stock(objcg, stock, nr_bytes, allow_uncharge);
> > >> + unlock_stock(stock);
> > >
> > > Hi Johannes,
> > >
> > > I noticed that after this patch, obj_cgroup_uncharge_pages() is now inside
> > > the obj_stock.lock critical section. Since obj_cgroup_uncharge_pages() calls
> > > refill_stock(), which seems non-trivial, this might increase the lock hold time.
> > > In particular, could that lead to more failed trylocks for IRQ handlers on
> > > non-RT kernel (or for tasks that preempt others on RT kernel)?

Good catch. I did ponder this, but forgot by the time I wrote the
changelog.

> > Yes, it also seems a bit self-defeating? (at least in theory)
> >
> > refill_obj_stock()
> > trylock_stock()
> > __refill_obj_stock()
> > obj_cgroup_uncharge_pages()
> > refill_stock()
> > local_trylock() -> nested, will fail
>
> Not really as the local_locks are different i.e. memcg_stock.lock in
> refill_stock() and obj_stock.lock in refill_obj_stock().

Right, refilling the *byte* stock could produce enough excess that we
refill the *page* stock. Which in turn could produce enough excess
that we drain that back to the page counters (shared atomics).

> However Hao's concern is valid and I think it can be easily fixed by
> moving obj_cgroup_uncharge_pages() out of obj_stock.lock.

Note that we now have multiple callsites of __refill_obj_stock(). Do
we care enough to move this to the caller?

There are a few other places with a similar pattern:

- drain_obj_stock(): calls memcg_uncharge() under the lock
- drain_stock(): calls memcg_uncharge() under the lock
- refill_stock(): still does full drain_stock()

All of these could be more intentional about only updating the per-cpu
data under the lock and the page counters outside of it.

Given that IRQ allocations/frees are rare, nested ones even rarer, and
the "slowpath" is a few extra atomics, I'm not sure it's worth the
code complication. At least until proven otherwise.

What do you think?