Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode

From: Peter Zijlstra
Date: Tue Feb 14 2017 - 05:39:01 EST

On Sun, Feb 12, 2017 at 02:05:44PM +0900, Tejun Heo wrote:
> Hello,
> On Fri, Feb 10, 2017 at 06:51:45PM +0100, Peter Zijlstra wrote:
> > Sure, we're past that. This isn't about what memcg can or cannot do.
> > Previous discussions established that controllers come in two shapes:
> >
> > - task based controllers; these are build on per task properties and
> > groups are aggregates over sets of tasks. Since per definition inter
> > task competition is already defined on individual tasks, its fairly
> > trivial to extend the same rules to sets of tasks etc..
> >
> > Examples: cpu, cpuset, cpuacct, perf, pid, (freezer)
> >
> > - system controllers; instead of building from tasks upwards, they
> > split what previously would be machine wide / global state. For these
> > there is no natural competition rule vs tasks, and hence your
> > no-internal-task rule.
> >
> > Examples: memcg, io, hugetlb
> This is a bit of delta but as I wrote before, at least cpu (and
> accordingly cpuacct) won't stay purely task-based as we should account
> for resource consumptions which aren't tied to specific tasks to the
> matching domain (e.g. CPU consumption during writeback, disk
> encryption or CPU cycles spent to receive packets).

We should probably do that in another thread, but I'd probably insist on
separate controllers that co-operate to get that done.

> > > And here's another point, currently, all controllers are enabled
> > > consecutively from root. If we have leaf thread subtrees, this still
> > > works fine. Resource domain controllers won't be enabled into thread
> > > subtrees. If we allow switching back and forth, what do we do in the
> > > middle while we're in the thread part?
> >
> > From what I understand you cannot re-enable a controller once its been
> > disabled, right? If you disable it, its dead for the entire subtree.
> cgroups on creation don't enable controllers by default and users can
> enable and disable controllers dynamically as long as the conditions
> are met. So, they can be disable and re-enabled.

I was talking in a hierarchical sense, your section 2-4-2. Top-Down
constraint seems to state similar things to what I meant.

Once you disable a controller it cannot be re-enabled in a subtree.

> > > No matter what we do, it's
> > > gonna be more confusing and we lose basic invariants like "parent
> > > always has superset of control knobs that its child has".
> >
> > No, exactly that. I don't think I ever proposed something different.
> >
> > The "resource domain" flag I proposed violates the no-internal-processes
> > thing, but it doesn't violate that rule afaict.
> If we go to thread mode and back to domain mode, the control knobs for
> domain controllers don't make sense on the thread part of the tree and
> they won't have cgroup_subsys_state to correspond to either. For
> example,
> A - T - B
> B's memcg knobs would control memory distribution from A and cgroups
> in T can't have memcg knobs. It'd be weird to indicate that memcg is
> enabled in those cgroups too.

But memcg _is_ enabled for T. All the tasks are mapped onto A for
purpose of the system controller (memcg) and are subject to its

> We can make it work somehow. It's just weird-ass interface.

You could make these control files (read-only?) symlinks back to A's
actual control files. To more explicitly show this.

> > > As for the runtime overhead, if you get affected by adding a top-level
> > > cgroup in any measureable way, we need to fix that. That's not a
> > > valid argument for messing up the interface.
> >
> > I think cgroup tree depth is a more significant issue; because of
> > hierarchy we often do tree walks (uo-to-root or down-to-task).
> >
> > So creating elaborate trees is something I try not to do.
> So, as long as the depth stays reasonable (single digit or lower),
> what we try to do is keeping tree traversal operations aggregated or
> located on slow paths.

While at the same time you allowed that BPF cgroup thing to not be
hierarchical because iterating the tree was too expensive; or did I

Also, I think Mike showed you the pain and hurt are quite visible for
even a few levels.

Batching is tricky, you need to somehow bound the error function in
order to not become too big a factor on behaviour. Esp. for cpu, cpuacct
obviously doesn't care much as it doesn't enforce anything.

> In general, I think it's important to ensure that this in general is
> the case so that users can use the logical layouts matching the actual
> resource hierarchy rather than having to twist the layout for
> optimization.

One does what one can.. But it is important to understand the
constraints, nothing comes for free.

> > > Even if we allow switching back and forth, we can't make the same
> > > cgroup both resource domain && thread root. Not in a sane way at
> > > least.
> >
> > The back and forth thing yes, but even with a single level, the one
> > resource domain you tag will be both resource domain and thread root.
> Ah, you're right.

Also, there is the one giant wart in v2 wrt no-internal-processes;
namely the root group is allowed to have them.

Now I understand why this is so; so don't feel compelled to explain that
again, but it does make the model very ugly and has a real problem, see
below. OTOH, since it is there, I would very much like to make use of
this 'feature' and allow a thread-group on the root group.

And since you then _can_ have nested thread groups, it again becomes
very important to be able to find the resource domains, which brings me
back to my proposal; albeit with an addition constraint.

Now on to the problem of the no-internal-processes wart; how does
cgroup-v2 currently implement the whole container invariant? Because by
that invariant, a container's 'root' group must also allow