Re: [PATCHSET for-4.13] cgroup: implement cgroup2 thread mode, v2

From: Peter Zijlstra
Date: Mon Jul 10 2017 - 04:32:24 EST

On Fri, Jun 30, 2017 at 09:23:24AM -0400, Tejun Heo wrote:
> On Tue, Jun 27, 2017 at 09:01:43AM +0200, Peter Zijlstra wrote:
> > On Mon, Jun 12, 2017 at 05:27:53PM -0400, Tejun Heo wrote:

> > IIRC the problem with the 'threaded' marker is that it doesn't clearly
> > capture what a resource domain is.
> >
> > That is, assuming that a thread root is always a resource domain, we get
> > the following problem:
> >
> > If we set 'threaded' on the root group in order to create a thread
> > (sub)group. If we then want to create another domain group, we'd have to
> > clear 'threaded' on that.
> >
> > R (t=1)
> > / \
> > (t=1) T D (t=0)
> >
> > So far so good. However, now we want to create another thread group
> > under our domain group D, so we have to set its 'threaded' marker again:
> >
> > R (t=1)
> > / \
> > (t=1) T D (t=1)
> > /
> > T (t=1)
> >
> > And we can no longer identify D as a resource domain. If OTOH we mark
> > 'domain' we get:
> >
> > R (d=1)
> > / \
> > (d=0) T D (d=1)
> > /
> > T (d=0)
> >
> > Which clearly identifies the domains and the thread only groups.
> So, the difference between the two interfaces is that the one I
> proposed is marking the thread root which makes all its descendants
> threaded while the above is marking each individual cgroup as being
> whether a resource domain or threaded.

You start by marking the thread root, but then continue to mark all
'threaded' (including root). This then leads to the problem described
above where you cannot (easily) (re)discover what the actual root is.

My proposal differs in that we retain a clear difference between
resource domain / root and threaded (sub)trees.

> > Your objections to doing this were representing the resource controllers
> > in the intermediate thread-only groups like:
> >
> > R
> > \
> > T -- what to do with eg. memcg here?
> > \
> > D
> > \
> > T
> And that's a perfectly valid point and as you pointed out the downside
> of marking each node separately is that the interface would allow
> configurations which aren't supported (at least for now) and that
> there's just more to configure - the user has to set the mode on each
> node after creation which is just the natural cost of being able to
> express more.

I'm not sure my proposal results in _more_ configuration per-se. Yes
there are some differences, but they go both ways, with the threaded tag
its easier to create multiple threaded subgroups, but with the domain
tag its easier to create multiple domain subgroups.

And I think the bias for the domain tag -- easier to create more domains
-- is the right one. The whole threaded thing is fairly special purpose.

In any case; I'm fine with initially not supporting domains nested under
thread groups -- although I do think there's valid use-cases for doing

> > I suggested having all resource controllers represented with a soft-link
> > back into the (thread-root) resource domain. But you were not convinced
> > and worried people were going to be confused.
> There's more to it than just confusion because resource interface
> files belong to the parent cgroup rather than the cgroup which hosts
> the files. This becomes clear when thinking about which files a
> container should be granted write access to when delegating a cgroup
> subtree to it. I'm sure we can get around it some way but we need to
> be careful here.

I would suggest having all the back-links be RO. That way you can never
grant write permission, can never actually change things that don't make
'sense', but get a really good clue where you need to go.

> > By having both in nice units, its both conceptually clear that they're
> > the same kind of weight and easier to match weights.

> But, it doesn't have to be this or that. We can easily support both
> units by simply allowing, say, "-5n" to be written to cpu.weight file
> and interpret that as the nice value and exposing the closest nice
> value in the cpu.stat file.
> Does that sound workable?

Not sure; reading the value would become somewhat awkward I suppose. But
maybe we can simply do two files, whichever is written to last takes
precedence. And when a !nice weight is written, the nice file returns
-EINVAL or something.

Not particularly pretty though...