Re: [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext)

From: teawater

Date: Fri Mar 13 2026 - 02:25:52 EST


>
> >
> > On Mar 12, 2026, at 04:39, Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote:
> >
> > On Wed, Mar 11, 2026 at 03:19:31PM +0800, Muchun Song wrote:
> >
> > >
> > >
> > >
> > On Mar 8, 2026, at 02:24, Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote:
> >
> >
> > [...]
> >
> >
> > Per-Memcg Background Reclaim
> >
> > In the new memcg world, with the goal of (mostly) eliminating direct synchronous
> > reclaim for limit enforcement, provide per-memcg background reclaimers which can
> > scale across CPUs with the allocation rate.
> >
> > >
> > > Hi Shakeel,
> > >
> > > I'm quite interested in this. Internally, we privately maintain a set
> > > of code to implement asynchronous reclamation, but we're also trying to
> > > discard these private codes as much as possible. Therefore, we want to
> > > implement a similar asynchronous reclamation mechanism in user space
> > > through the memory.reclaim mechanism. However, currently there's a lack
> > > of suitable policy notification mechanisms to trigger user threads to
> > > proactively reclaim in advance.
> > >
> >
> > Cool, can you please share what "suitable policy notification mechanisms" you
> > need for your use-case? This will give me more data on the comparison between
> > memory.reclaim and the proposed approach.
> >
> If we expect the proactive reclamation to be triggered when the current
> memcg's memory usage reaches a certain point, we have to continuously read
> memory.current to determine whether it has reached our set watermark value
> to trigger asynchronous reclamation. Perhaps we need an event that can notify
> user-space threads when the current memory usage reaches a specific
> watermark value. Currently, the events supported by memory.events may lack
> the capability for custom watermarks.

I agree. Even with BPF controlling proactive reclamation, I believe
there needs to be an event reflecting capacity changes to signal
when to stop.
Otherwise, the reclamation volume per batch would have to be set very
low, leading to frequent BPF triggers and poor efficiency.

Best,
Hui


>
> >
> > >
> > >
> > >
> >
> > Lock-Aware Throttling
> >
> > The ability to avoid throttling an allocating task that is holding locks, to
> > prevent priority inversion. In Meta's fleet, we have observed lock holders stuck
> > in memcg reclaim, blocking all waiters regardless of their priority or
> > criticality.
> >
> > >
> > > This is a real problem we encountered, especially with the jbd handler
> > > resources of the ext4 file system. Our current attempt is to defer
> > > memory reclamation until returning to user space, in order to solve
> > > various priority inversion issues caused by the jbd handler. Therefore,
> > > I would be interested to discuss this topic.
> > >
> >
> > Awesome, do you use memory.max and memory.high both and defer the reclaim for
> > both? Are you deferring all the reclaims or just the ones where the charging
> > process has the lock? (I need to look what jbd handler is).
> >
> We do not use memory.high, although it supports deferring memory reclamation
> to user-space, it also attempts to throttle memory allocation speed, which
> introduces significant latency. In our application's case, we would rather
> accept an OOM under such circumstances. We previously attempted to address
> the priority inversion issue caused by the jbd handler separately (which we
> frequently encounter since we use the ext4 file system), and you can refer
> to this [1]. Of course, this solution lacks generality, as it requires
> calling new interfaces for various lock resources. Therefore, we internally
> have a more aggressive idea: defer all reclamation triggered by kernel-space
> memory allocation until just before returning to user-space. This should
> resolve the vast majority of priority inversion problems. The only potential
> issue introduced is that kernel-space memory usage may briefly exceed memory.max.
>
> [1] https://lore.kernel.org/linux-mm/cover.1750234270.git.hezhongkun.hzk@xxxxxxxxxxxxx/#r
>
> Muchun,
> Thanks.
>