Re: [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware

From: Michal Hocko

Date: Tue Feb 24 2026 - 06:32:38 EST


On Mon 23-02-26 14:38:23, Joshua Hahn wrote:
> Memory cgroups provide an interface that allow multiple workloads on a
> host to co-exist, and establish both weak and strong memory isolation
> guarantees. For large servers and small embedded systems alike, memcgs
> provide an effective way to provide a baseline quality of service for
> protected workloads.
>
> This works, because for the most part, all memory is equal (except for
> zram / zswap). Restricting a cgroup's memory footprint restricts how
> much it can hurt other workloads competing for memory. Likewise, setting
> memory.low or memory.min limits can provide weak and strong guarantees
> to the performance of a cgroup.
>
> However, on systems with tiered memory (e.g. CXL / compressed memory),
> the quality of service guarantees that memcg limits enforced become less
> effective, as memcg has no awareness of the physical location of its
> charged memory. In other words, a workload that is well-behaved within
> its memcg limits may still be hurting the performance of other
> well-behaving workloads on the system by hogging more than its
> "fair share" of toptier memory.

This assumes that the active workingset size of all workloads doesn't
fit into the top tier right? Otherwise promotions would make sure to
that we have the most active memory in the top tier. Is this typical in
real life configurations?

Or do you intend to limit memory consumption on particular tier even
without an external pressure?

> Introduce tier-aware memcg limits, which scale memory.low/high to
> reflect the ratio of toptier:total memory the cgroup has access.
>
> Take the following scenario as an example:
> On a host with 3:1 toptier:lowtier, say 150G toptier, and 50Glowtier,
> setting a cgroup's limits to:
> memory.min: 15G
> memory.low: 20G
> memory.high: 40G
> memory.max: 50G
>
> Will be enforced at the toptier as:
> memory.min: 15G
> memory.toptier_low: 15G (20 * 150/200)
> memory.toptier_high: 30G (40 * 150/200)
> memory.max: 50G

Let's spend some more time with the interface first. You seem to be
focusing only on the top tier with this interface, right? Is this really the
right way to go long term? What makes you believe that we do not really
hit the same issue with other tiers as well? Also do we want/need to
duplicate all the limits for each/top tier? What is the reasoning for
the switch to be runtime sysctl rather than boot-time or cgroup mount
option?

I will likely have more questions but these are immediate ones after
reading the cover. Please note I haven't really looked at the
implementation yet. I really want to understand usecases and interface
first.
--
Michal Hocko
SUSE Labs