Re: memcg reclaim demotion wrt. isolation

From: Johannes Weiner
Date: Wed Dec 14 2022 - 12:40:30 EST


Hey Michal,

On Wed, Dec 14, 2022 at 04:29:06PM +0100, Michal Hocko wrote:
> On Wed 14-12-22 13:40:33, Johannes Weiner wrote:
> > The only way to prevent cgroups from disrupting each other on NUMA
> > nodes is NUMA constraints. Cgroup per-node limits. That shields not
> > only from demotion, but also from DoS-mbinding, or aggressive
> > promotion. All of these can result in some form of premature
> > reclaim/demotion, proactive demotion isn't special in that way.
>
> Any numa based balancing is a real challenge with memcg semantic. I do
> not see per numa node memcg limits without a major overhaul of how we do
> charging though. I am not sure this is on the table even long term.
> Unless I am really missing something here we have to live with the
> existing semantic for a foreseeable future.

Yes, I think you're quite right.

We've been mostly skirting the NUMA issue in cgroups (and to a degree
in MM code in general) with two possible answers:

a) The NUMA distances are close enough that we ignore it and pretend
all memory is (mostly) fungible.

b) The NUMA distances are big enough that it matters, in which case
the best option is to avoid sharing, and use bindings to keep
workloads/containers isolated to their own CPU+memory domains.

Tiered memory forces the issue by providing memory that must be shared
between workloads/containers, but is not fungible. At least not
without incurring priority inversions between containers, where a
lopri container promotes itself to the top and demotes the hipri
workload, while staying happily within its global memory allowance.

This applies to mbind() cases as much as it does to NUMA balancing.

If these setups proliferate, it seems inevitable to me that sooner or
later the full problem space of memory cgroups - dividing up a shared
resource while allowing overcommit - applies not just to "RAM as a
whole", but to each memory tier individually.

Whether we need the full memcg interface per tier or per node, I'm not
sure. It might be enough to automatically apportion global allowances
to nodes; so if you have 32G toptier and 16G lowtier, and a cgroup has
a 20G allowance, it gets 13G on top and 7G on low.

(That, or we settle on multi-socket systems with private tiers, such
that memory continues to be unshared :-)

Either way, I expect this issue will keep coming up as we try to use
containers on such systems.