Re: [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext)

From: Jiayuan Chen

Date: Wed Mar 11 2026 - 00:58:35 EST

On 3/8/26 2:24 AM, Shakeel Butt wrote:

Over the last couple of weeks, I have been brainstorming on how I would go
about redesigning memcg, taking inspiration from sched_ext and bpfoom, with a
focus on existing challenges and issues. This proposal outlines the high-level
direction. Followup emails and patch series will cover and brainstorm the
mechanisms (of course BPF) to achieve these goals.

Memory cgroups provide memory accounting and the ability to control memory usage
of workloads through two categories of limits. Throttling limits (memory.max and
memory.high) cap memory consumption. Protection limits (memory.min and
memory.low) shield a workload's memory from reclaim under external memory
pressure.

Challenges
----------

- Workload owners rarely know their actual memory requirements, leading to
overprovisioned limits, lower utilization, and higher infrastructure costs.

- Throttling limit enforcement is synchronous in the allocating task's context,
which can stall latency-sensitive threads.

- The stalled thread may hold shared locks, causing priority inversion -- all
waiters are blocked regardless of their priority.

- Enforcement is indiscriminate -- there is no way to distinguish a
performance-critical or latency-critical allocator from a latency-tolerant
one.

- Protection limits assume static working sets size, forcing owners to either
overprovision or build complex userspace infrastructure to dynamically adjust
them.

Feature Wishlist
----------------

Here is the list of features and capabilities I want to enable in the
redesigned memcg limit enforcement world.

Per-Memcg Background Reclaim

In the new memcg world, with the goal of (mostly) eliminating direct synchronous
reclaim for limit enforcement, provide per-memcg background reclaimers which can
scale across CPUs with the allocation rate.

This sounds like a very useful approach. I have a few questions I'm thinking through:

How would you approach implementing this background reclaim? I'm imagining
something like asynchronous memory.reclaim operations - is that in line
with your thinking?

And regarding cold page identification - do you have a preferred approach?
I'm curious what the most practical way would be to accurately identify
which pages to reclaim.

Would be great to hear your perspective.