Re: [PATCH v2] mm: mglru: fix stale batch updates after memcg reparenting

From: Qi Zheng

Date: Tue Jun 23 2026 - 05:20:16 EST


Hi Harry,

On 6/23/26 4:18 PM, Harry Yoo wrote:


On 6/23/26 4:16 PM, Qi Zheng wrote:
Hi Harry,

Hi Qi!

On 6/23/26 2:17 PM, Harry Yoo wrote:
On 6/23/26 11:42 AM, Qi Zheng wrote:
From: Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx>

The mglru page table walker batches per-generation size deltas in
walk->nr_pages while walking page tables without holding the lruvec
lock.
The reset_batch_size() later folds those deltas into walk->lruvec under
the lruvec lock.

Ouch.

IIRC the user-visible impact of underestimated nr_pages in MGLRU
was premature OOMs because MGLRU does not try to reclaim memory when
nr_pages reaches zero, but there are still more pages.

Perhaps worth mentioning in the changelog?

Maybe this should be placed before "To fix it...".

Thanks!

The page table walker can run concurrently with the memcg reparenting
path
as follows:

CPU0 CPU1
==== ====

walk_mm
--> walk_page_range
--> update_batch_size
--> walk->nr_pages += delta

mem_cgroup_css_offline
--> memcg_reparent_objcgs
--> lock lruvec
lru_gen_reparent_memcg
--> reparent child folios to
parent
unlock lruvec

lock lruvec
reset_batch_size
--> child lrugen->nr_pages += delta

The problem here is that, while grabbing a reference to memcg
(via mem_cgroup_iter(), for example) makes sure that the memcg is not
freed, it does not prevent offlining happening, and reset_batch_size()
doesn't check whether the lruvec has been reparented, or the lruvec
is going to be reparented.

This will trigger the following warning in lru_gen_exit_memcg():

VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
sizeof(lruvec->lrugen.nr_pages)));

To fix it, add lrugen->reparented to remember the new owner of a
reparented lruvec, and make reset_batch_size() charge pending deltas to
that owner.

Could you please explain why it is unavoidable to introduce the new
field and why checking whether the cgroup is dying (and charging deltas
to non-dying parent) doesn't work?

Peiyang tried doing this [1], but it doesn't work because
ss->css_offline() is called before clearing the CSS_ONLINE flag.

Right.

I also considered using mem_cgroup_tryget_online(), but that only prevent
the memcg from being freed. It's doesn't prevent the offlining.

Right.

I think checking CSS_DYING under RCU and grabbing the lruvec
of the first non-dying memcg should work (this pattern is already
used where we use RCU to guarantee memcgs are not freed).

If we do not observe CSS_DYING flag, it is safe to charge deltas
to the lruvec because RCU guarantees that reparenting cannot happen
under us.

If we do observe CSS_DYING, we can walk up the hierarchy and charge
deltas to the first non-dying memcg.

Checking CSS_DYING looks feasible, but the rcu lock alone cannot prevent
reparenting. We should recheck CSS_DYING after acquiring the lruvec
lock, otherwise we might run into the following race:

CPU0 reset_batch_size CPU1 memcg teardown
===================== ==================

read !CSS_DYING

set CSS_DYING
memcg_reparent_objcgs()
lock child lruvec
move child to parent
zero child nr_pages
unlock child lruvec

lock child lruvec
charge stale delta to child

So it seems lock_batch_lruvec() should be implemented like this:

static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
{
struct mem_cgroup *memcg = lruvec_memcg(lruvec);

rcu_read_lock();
retry:
while (memcg && css_is_dying(&memcg->css))
memcg = parent_mem_cgroup(memcg);

lruvec = mem_cgroup_lruvec(memcg, pgdat);
spin_lock_irq(&lruvec->lru_lock);
if (memcg && unlikely(css_is_dying(&memcg->css))) {
spin_unlock_irq(&lruvec->lru_lock);
goto retry;
}

rcu_read_unlock();

return lruvec;
}

This way, there is no need to add lrugen->reparented, right?

Thanks,
Qi