Re: [PATCH v2 1/2] mm/mglru: fix cgroup OOM during MGLRU state switching

From: Barry Song

Date: Thu Mar 12 2026 - 16:08:46 EST

On Fri, Mar 13, 2026 at 12:44 AM Leno Hou <lenohou@xxxxxxxxx> wrote:
>
> On 3/12/26 2:02 PM, Barry Song wrote:
> [...]
> >> This effectively eliminates the race window that previously triggered OOMs
> >> under high memory pressure.
> >
> > I don't think this eliminates the race window, but it does reduce it.
> > There is nothing preventing the draining state from changing while
> > you are shrinking.
> >
> Hi Barry，
>
> You raise a valid point about the race window.
>
> While the window exists, I believe the impact is effectively mitigated by
> The reclaim retry mechanism: Even if the reclaimer makes a suboptimal path
> decision due to the race, the kernel's reclaim loop (in do_try_to_free_pages)
> provides multiple retries. If the first pass fails to reclaim memory
> due to the
> race or an inconsistent state, the subsequent retry will observe the updated
> draining state and correctly direct the scan to the appropriate LRU lists.
>
> Anyway，thanks correct me and i'll updating the commit message.

I’m fine with this race condition as long as it is clearly documented
in the commit message.

>
> >>
> >> The issue was consistently reproduced on v6.1.157 and v6.18.3 using
> >> a high-pressure memory cgroup (v1) environment.
> >>
> >> To: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> >> To: Axel Rasmussen <axelrasmussen@xxxxxxxxxx>
> >> To: Yuanchu Xie <yuanchu@xxxxxxxxxx>
> >> To: Wei Xu <weixugc@xxxxxxxxxx>
> >> To: Barry Song <21cnbao@xxxxxxxxx>
> >> To: Jialing Wang <wjl.linux@xxxxxxxxx>
> >> To: Yafang Shao <laoar.shao@xxxxxxxxx>
> >> To: Yu Zhao <yuzhao@xxxxxxxxxx>
> >> To: Kairui Song <ryncsn@xxxxxxxxx>
> >> To: Bingfang Guo <bfguo@xxxxxxxxxx>
> >> Cc: linux-mm@xxxxxxxxx
> >> Cc: linux-kernel@xxxxxxxxxxxxxxx
> >> Signed-off-by: Leno Hou <lenohou@xxxxxxxxx>
> >> ---
> >> include/linux/mm_inline.h | 5 +++++
> >> mm/rmap.c | 2 +-
> >> mm/swap.c | 14 ++++++++------
> >> mm/vmscan.c | 49 ++++++++++++++++++++++++++++++++++++++---------
> >> 4 files changed, 54 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> >> index fa2d6ba811b5..e6443e22bf67 100644
> >> --- a/include/linux/mm_inline.h
> >> +++ b/include/linux/mm_inline.h
> >> @@ -321,6 +321,11 @@ static inline bool lru_gen_in_fault(void)
> >> return false;
> >> }
> >>
> >> +static inline int folio_lru_gen(const struct folio *folio)
> >> +{
> >> + return -1;
> >> +}
> >> +
> >> static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> >> {
> >> return false;
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index 0f00570d1b9e..488bcdca65ed 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -958,7 +958,7 @@ static bool folio_referenced_one(struct folio *folio,
> >> return false;
> >> }
> >>
> >> - if (lru_gen_enabled() && pvmw.pte) {
> >> + if ((folio_lru_gen(folio) != -1) && pvmw.pte) {
> >
> > I am not quite sure if a folio's gen is set to -1 when it is isolated
> > from MGLRU for reclamation. If so, I don't think this would work.
> >
>
> Thanks for your feedback. You are absolutely right—relying solely on
> folio_lru_gen(folio) != -1 is insufficient because the generation tag
> is cleared during isolation in lru_gen_del_folio().
>
> In the current implementation, once a page is isolated, its generation
> information is lost, causing the logic to fall back to the traditional LRU
> path prematurely.
>
> To address this, I am changing the logic to:
> ```c
> @@ -966,7 +966,7 @@ static bool folio_referenced_one(struct folio *folio,
> nr = folio_pte_batch(folio, pvmw.pte, pteval, max_nr);
> }
>
> - if (lru_gen_enabled() && pvmw.pte) {
> + if (lru_gen_enabled() && !lru_gen_draining() && pvmw.pte) {
> if (lru_gen_look_around(&pvmw, nr))
> referenced++;
> } else if (pvmw.pte) {
> ```
> The rationale is that lru_gen_look_around() is an MGLRU-specific optimization
> that promotes pages based on generation aging. During the draining
> transition, we should strictly rely on traditional RMAP-based reference
> auditing to ensure that pages are correctly migrated to traditional LRU
> lists without being spuriously promoted by MGLRU logic. This avoids
> mixing MGLRU's generation-based promotion with traditional LRU's 'active'
> status, preventing potential reclaim state inconsistencies during the
> transition period."

Right, using folio_lru_gen(folio) != -1 is dangerous because it
implicitly depends on whether the folio is still on the LRU list.
Even if we never switch between MGLRU and the active/inactive LRU,
the check can become invalid once the folio is removed from the LRU
for various reasons—for example, during isolate_folio().

So if possible, we should identify the real issue that led us to rely
on folio_lru_gen(folio) != -1 in the first place, and avoid inventing
this kind of unstable condition check.

Please DO NOT depend on folio_lru_gen(folio) != -1. If it is
unavoidable, limit its use to the smallest scope possible.

Thanks
Barry