Re: [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext)

From: Shakeel Butt

Date: Wed Mar 11 2026 - 17:36:38 EST


Hi Greg,

On Wed, Mar 11, 2026 at 12:29:45AM -0700, Greg Thelen wrote:
> On Sat, Mar 7, 2026 at 10:24 AM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote:
> >
> >
>
> Very interesting set of topics. A few more come to mind.

Thanks.

>
> I've wondered about preallocating memory or guaranteeing access to
> physical memory for a job. Memcg has max limits and min protections,
> but no preallocation (i.e. no conceptual memcg free list). So if a job
> is configured with 1GB min workingset protection that only ensures 1GB
> won't be reclaimed, not that 1GB can be allocated in a reasonable
> amount of time. This isn't just a job startup problem: if a page is
> freed with MADV_DONTNEED a subsequent pgfault may require a lot of
> time to handle, even if usage is below min.

This is indeed correct i.e. protection limits protect the workload from external
reclaim but does not provide any gurantee on allocating memory in a reasonable
cheap way (without triggering reclaim/compaction). This is one of the challenge
to implement userspace oom-killer in an aggressively overcommitted environment.

However to me providing memory allocation guarantees is more of a system level
feature and orthogonal to memcg. And I see your next para is about that :)

Anyways I think if we keep system memory utilization below some value and
guarantee there is always some free memory (this can be done by having common
ancestor of all workloads and ancestor has a limit or node controller maintains
the condition that the sum of limits of all top level cgroups is below some
percentage of total memory) then we might not need memcg free list or similar
mechanisms (most of the time, I think).

>
> Initial allocation policies are controlled by mempolicy/cpuset. Should
> we continue to keep allocation policies and resource accounting
> separate? It's a little strange that memcg can (1) cap max usage of
> tier X memory, and (2) provide minimum protection for tier X usage,
> but has no influence on where memory is initially allocated?

I think I understand your point but I think the implementation would be too
messy. This is orthogonal to the proposal but I would say a good topic for
LSFMMBPF if you want to lead the discussion.