Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection

From: Johannes Weiner
Date: Tue Feb 18 2020 - 14:52:58 EST

On Mon, Feb 17, 2020 at 09:41:00AM +0100, Michal Hocko wrote:
> On Fri 14-02-20 11:53:11, Johannes Weiner wrote:
> [...]
> > The proper solution to implement the kind of resource hierarchy you
> > want to express in cgroup2 is to reflect it in the cgroup tree. Yes,
> > the_workload might have been started by user 100 in session c2, but in
> > terms of resources, it's prioritized over system.slice and user.slice,
> > and so that's the level where it needs to sit:
> >
> > root
> > / | \
> > system.slice user.slice the_workload
> > / | |
> > cron journal user-100.slice
> > |
> > session-c2.scope
> > |
> > misc
> >
> > Then you can configure not just memory.low, but also a proper io
> > weight and a cpu weight. And the tree correctly reflects where the
> > workload is in the pecking order of who gets access to resources.
> I have already mentioned that this would be the only solution when the
> protection would work, right. But I am also saying that this a trivial
> example where you simply _can_ move your workload to the 1st level. What
> about those that need to reflect organization into the hierarchy. Please
> have a look at
> Are you saying they are just not supported? Are they supposed to use
> cgroup v1 for the organization and v2 for the resource control?

>From that email:

> Let me give you an example. Say you have a DB workload which is the
> primary thing running on your system and which you want to protect from
> an unrelated activity (backups, frontends, etc). Running it inside a
> cgroup with memory.low while other components in other cgroups without
> any protection achieves that. If those cgroups are top level then this
> is simple and straightforward configuration.
> Things would get much more tricky if you want run the same workload
> deeper down the hierarchy - e.g. run it in a container. Now your
> "root" has to use an explicit low protection as well and all other
> potential cgroups that are in the same sub-hierarchy (read in the same
> container) need to opt-out from the protection because they are not
> meant to be protected.

You can't prioritize some parts of a cgroup higher than the outside of
the cgroup, and other parts lower than the outside. That's just not
something that can be sanely supported from the controller interface.

However, that doesn't mean this usecase isn't supported. You *can*
always split cgroups for separate resource policies.

And you *can* split cgroups for group labeling purposes too (tracking
stuff that belongs to a certain user).

So in the scenario where you have an important database and a
not-so-important secondary workload, and you want them to run them
containerized, there are two possible scenarios:

- The workloads are co-dependent (e.g. a logging service for the
db). In that case you actually need to protect them equally,
otherwise you'll have priority inversions, where the primary gets
backed up behind the secondary in some form or another.

- The workloads don't interact with each other. In that case, you can
create two separate containers, one high-pri, one low-pri, and run
them in parallel. They can share filesystem data, page cache
etc. where appropriate, so this isn't a problem.

The fact that they belong to the same team/organization/"user"
e.g. is an attribute that can be tracked from userspace and isn't
material from a kernel interface POV.

You just have two cgroups instead of one to track; but those cgroups
will still contain stuff like setsid(), setuid() etc. so users
cannot escape whatever policy/containment you implement for them.

> In short we simply have to live with usecases where the cgroup hierarchy
> follows the "logical" workload organization at the higher level more
> than resource control. This is the case for systemd as well btw.
> Workloads are organized into slices and scopes without any direct
> relation to resources in mind.

As I said in the previous email: Yes, per default, because it starts
everything in a single resource domain. But it has all necessary
support for dividing the tree into disjunct resource domains.

> Does this make it more clear what I am thinking about? Does it sound
> like a legit usecase?

The desired behavior is legit, but you have to split the cgroups on
conflicting attributes - whether organizational or policy-related -
for properly expressing what you want from the kernel.