Re: [PATCH v2] mm: mglru: fix stale batch updates after memcg reparenting

From: Harry Yoo

Date: Tue Jun 23 2026 - 04:19:12 EST




On 6/23/26 4:16 PM, Qi Zheng wrote:
> Hi Harry,

Hi Qi!

> On 6/23/26 2:17 PM, Harry Yoo wrote:
>> On 6/23/26 11:42 AM, Qi Zheng wrote:
>>> From: Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx>
>>>
>>> The mglru page table walker batches per-generation size deltas in
>>> walk->nr_pages while walking page tables without holding the lruvec
>>> lock.
>>> The reset_batch_size() later folds those deltas into walk->lruvec under
>>> the lruvec lock.
>>
>> Ouch.
>>
>> IIRC the user-visible impact of underestimated nr_pages in MGLRU
>> was premature OOMs because MGLRU does not try to reclaim memory when
>> nr_pages reaches zero, but there are still more pages.
>>
>> Perhaps worth mentioning in the changelog?
>
> Maybe this should be placed before "To fix it...".

Thanks!

>>> The page table walker can run concurrently with the memcg reparenting
>>> path
>>> as follows:
>>>
>>> CPU0 CPU1
>>> ==== ====
>>>
>>> walk_mm
>>> --> walk_page_range
>>> --> update_batch_size
>>> --> walk->nr_pages += delta
>>>
>>> mem_cgroup_css_offline
>>> --> memcg_reparent_objcgs
>>> --> lock lruvec
>>> lru_gen_reparent_memcg
>>> --> reparent child folios to
>>> parent
>>> unlock lruvec
>>>
>>> lock lruvec
>>> reset_batch_size
>>> --> child lrugen->nr_pages += delta
>>
>> The problem here is that, while grabbing a reference to memcg
>> (via mem_cgroup_iter(), for example) makes sure that the memcg is not
>> freed, it does not prevent offlining happening, and reset_batch_size()
>> doesn't check whether the lruvec has been reparented, or the lruvec
>> is going to be reparented.
>>
>>> This will trigger the following warning in lru_gen_exit_memcg():
>>>
>>> VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
>>> sizeof(lruvec->lrugen.nr_pages)));
>>>
>>> To fix it, add lrugen->reparented to remember the new owner of a
>>> reparented lruvec, and make reset_batch_size() charge pending deltas to
>>> that owner.
>>
>> Could you please explain why it is unavoidable to introduce the new
>> field and why checking whether the cgroup is dying (and charging deltas
>> to non-dying parent) doesn't work?
>
> Peiyang tried doing this [1], but it doesn't work because
> ss->css_offline() is called before clearing the CSS_ONLINE flag.

Right.

> I also considered using mem_cgroup_tryget_online(), but that only prevent
> the memcg from being freed. It's doesn't prevent the offlining.

Right.

I think checking CSS_DYING under RCU and grabbing the lruvec
of the first non-dying memcg should work (this pattern is already
used where we use RCU to guarantee memcgs are not freed).

If we do not observe CSS_DYING flag, it is safe to charge deltas
to the lruvec because RCU guarantees that reparenting cannot happen
under us.

If we do observe CSS_DYING, we can walk up the hierarchy and charge
deltas to the first non-dying memcg.

This requires introducing an API to grab the first non-dying
lruvec lock like folio_lruvec_lock_irq(), but it can be called only
when the caller has a reference to make sure it doens't go away.
(I wonder if there are other places that requires similar semantics)

or am I missing something?

> So in the end, I chose the approach used in this patch. Simply adding
> a new field to mglru to track its reparenting status seems to be the
> most straightforward and effective approach.

I think it would work, but we need some explanation on why this
requires special handling compared to other stuff (e.g., reparenting
of split queues).

> Thanks,
> Qi

Thanks for working on this, Qi!

> [1]. https://lore.kernel.org/all/5A9E929D82717101+12fcf643-efb8-4b9a-
> a53a-1e28cc894f0b@xxxxxxxxxxxxxxxx
>
>>> Reported-by: Peiyang He <peiyang_he@xxxxxxxxxxxxxxxx>
>>> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-
>>> efb8-4b9a-a53a-1e28cc894f0b@xxxxxxxxxxxxxxxx
>>> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU folios")
>>> Cc: <stable@xxxxxxxxxxxxxxx>
>>> Signed-off-by: Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx>
>>> Reviewed-by: Barry Song <baohua@xxxxxxxxxx>

--
Cheers,
Harry / Hyeonggon

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature