Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching

From: Barry Song

Date: Mon Mar 02 2026 - 20:34:30 EST

On Tue, Mar 3, 2026 at 1:52 AM Yuanchu Xie <yuanchu@xxxxxxxxxx> wrote:
>
> Hi Yafang,
>
> On Mon, Mar 2, 2026 at 8:36 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> >
> > On Mon, Mar 2, 2026 at 5:48 PM Kairui Song <ryncsn@xxxxxxxxx> wrote:
> > >
> > > On Mon, Mar 2, 2026 at 5:20 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
> > > >
> > > > On Mon, Mar 2, 2026 at 4:25 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> > > > >
> > > > > The challenge we're currently facing is that we don't yet know which
> > > > > workloads would benefit from it ;)
> > > > > We do want to enable mglru on our production servers, but first we
> > > > > need to address the risk of OOM during the switch—that's exactly why
> > > > > we're proposing this patch.
> > > >
> > > > Nobody objects to your intention to fix it. I’m curious: to what
> > > > extent do we want to fix it? Do we aim to merely reduce the probability
> > > > of OOM and other mistakes, or do we want a complete fix that makes
> > > > the dynamic on/off fully safe?
> > >
> > > Yeah, I'm glad that more people are trying MGLRU and improving it.
> > >
> > > We also have an downstream fix for the OOM on switch issue, but that's
> > > mostly as a fallback in case MGLRU doesn't work well, our goal is
> > > still try to enable MGLRU as much as possible,
> >
> > Our goals are aligned.
> > Before enabling mglru, we must first ensure it won't cause OOM errors
> > across multiple servers. We propose fixing this because, during our
> > previous mglru enablement, many instances of a single service OOM'd
> > simultaneously—potentially leading to data loss for that service.
>
> Would it be possible to drain the jobs away from the machine before
> switching LRUs? The MGLRU kill-switch could be improved, but making
> the switch more or less "hitless" would require significant work. Is
> the use case a one-time switch from active/inactive to MGLRU?

I guess the point is that if upstream provides a sysctl to
toggle MGLRU on and off, then that sysctl should actually
work as intended. Otherwise, it would be better to remove
it.

Based on the previous discussion, we have two options:

1. Reduce the likelihood of OOM and other errors.
This could be achieved either by applying Leno's patch,
which suggests shrinking both MGLRU and active/inactive
lists during switching, or by making shrink_lruvec wait
until the switching is complete via schedule_timeout().

Note that there is no guarantee the switching state
won’t change during shrink_lruvec.

2. Ensure that shrinking and switching do not occur
simultaneously by using something like an rwsem —
shrinking can proceed in parallel under the read
lock, while the (rare) switching path takes the
write lock.

If we want to keep the toggle, we could at least make a
small change to reduce the likelihood of mistakes?

> I do want to note that OOMs causing data loss is not really the kernel's fault.
>

Best Regards
Barry