Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

From: Paul Turner
Date: Mon Aug 24 2015 - 16:52:38 EST

On Sat, Aug 22, 2015 at 11:29 AM, Tejun Heo <tj@xxxxxxxxxx> wrote:
> Hello, Paul.
> On Fri, Aug 21, 2015 at 12:26:30PM -0700, Paul Turner wrote:
> ...
>> A very concrete example of the above is a virtual machine in which you
>> want to guarantee scheduling for the vCPU threads which must schedule
>> beside many hypervisor support threads. A hierarchy is the only way
>> to fix the ratio at which these compete.
> Just to learn more, what sort of hypervisor support threads are we
> talking about? They would have to consume considerable amount of cpu
> cycles for problems like this to be relevant and be dynamic in numbers
> in a way which letting them competing against vcpus makes sense. Do
> IO helpers meet these criteria?

I'm not sure what you mean by an IO helper. By support threads I mean
any threads that are used in the hypervisor implementation that are
not hosting a vCPU.

>> An example that's not the cpu controller is that we use cpusets to
>> expose to applications their "shared" and "private" cores. (These
>> sets are dynamic based on what is coscheduled on a given machine.)
> Can you please go into more details with these?

We typically share our machines between many jobs, these jobs can have
cores that are "private" (and not shared with other jobs) and cores
that are "shared" (general purpose cores accessible to all jobs on the
same machine).

The pool of cpus in the "shared" pool is dynamic as jobs entering and
leaving the machine take or release their associated "private" cores.

By creating the appropriate sub-containers within the cpuset group we
allow jobs to pin specific threads to run on their (typically) private
cores. This also allows the management daemons additional flexibility
as it's possible to update which cores we place as private, without
synchronization with the application. Note that sched_setaffinity()
is a non-starter here.

>> > Why would you assume that threads of a process wouldn't want to
>> > configure it ever? How is this different from CPU affinity?
>> In general cache and CPU behave differently. Generally for it to make
>> sense between threads in a process they would have to have wholly
>> disjoint memory, at which point the only sane long-term implementation
>> is separate processes and the management moves up a level anyway.
>> That said, there are surely cases in which it might be convenient to
>> use at a per-thread level to correct a specific performance anomaly.
>> But at that point, you have certainly reached the level of hammer that
>> you can coordinate with an external daemon if necessary.
> So, I'm not super familiar with all the use cases but the whole cache
> allocation thing is almost by nature a specific niche thing and I feel
> pretty reluctant to blow off per-thread usages as too niche to worry
> about.

Let me try to restate:
I think that we can specify the usage is specifically niche that it
will *typically* be used by higher level management daemons which
prefer a more technical and specific interface. This does not
preclude use by threads, it just makes it less convenient; I think
that we should be optimizing for flexibility over ease-of-use for a
very small number of cases here.

>> > I don't follow what you're trying to way with the above paragraph.
>> > Are you still talking about CAT? If so, that use case isn't the only
>> > one. I'm pretty sure there are people who would want to configure
>> > cache allocation at thread level.
>> I'm not agreeing with you that "in cgroups" means "must be usable by
>> applications within that hierarchy". A cgroup subsystem used as a
>> partitioning API only by system management daemons is entirely
>> reasonable. CAT is a reasonable example of this.
> I see. The same argument. I don't think CAT just being system
> management thing makes sense.
>> > So, this is a trade-off we're consciously making. If there are
>> > common-enough use cases which require jumping across different cgroup
>> > domains, we'll try to figure out a way to accomodate those but by
>> > default migration is a very cold and expensive path.
>> The core here was the need for allowing sub-process migration. I'm
>> not sure I follow the performance trade-off argument; haven't we
>> historically seen the opposite? That migration has been a slow-path
>> without optimizations and people pushing to make it faster? This
>> seems a hard generalization to make for something that's inherently
>> tied to a particular controller.
> It isn't something tied to a particular controller. Some controllers
> may get impacted less by than others but there's an inherent
> connection between how dynamic an association is and how expensive the
> locking around it needs to be and we need to set up basic behavior and
> usage conventions so that different controllers are designed and
> implemented assuming similar usage patterns; otherwise, we end up with
> the chaotic shit show that we have had where everything behaves
> differently and nobody knows what's the right way to do things and we
> end up locked into weird requirements which some controller induced
> for no good reason but cause significant pain on use cases which
> actually matter.
>> I don't care if we try turning that dial back to assume it's a cold
>> path once more, only that it's supported.
> It has always been a cold path and I'm not saying this is gonna be
> noticeably worse in the future but usages like bouncing threads on
> request-by-request basis are and will be clearly worse than bouncing
> to threads which are already in the target domain.
>> >> > A forwarding /proc/thread_self/cgroup accessor, or similar, would be another
>> >> > way to address some of these issues.
>> >
>> > That sounds horrible to me. What if the process wants to do RMW a
>> > config?
>> Locking within a process is easy.
> It's not contained in the process at all. What if an external entity
> decides to migrate the process into another cgroup inbetween?

If we have 'atomic' moves and a way to access our sub-containers from
the process in a consistent fashion (e.g. relative paths) then this is
not an issue.

>> > What if the permissions are different after an intervening
>> > migration?
>> This is a side-effect of migration not being properly supported.
>> > What if the sub-hierarchy no longer exists or has been
>> > replaced by a hierarchy with the same topology but actualy is a
>> > different one?
>> The easy answer is that: Only a process should be managing its
>> sub-hierarchy. That's the nice thing about hierarchies.
> cgroupfs is a horrible place to implement that part of interface. It
> doesn't make any sense to combine those two into the same hierarchy.
> You're agreeing to the identified problem but somehow still suggesting
> doing what we've been doing when the root cause of the said problem is
> conflating and interlocking these two separate things.

I am not endorsing the world we are in today, only describing how it
can be somewhat sanely managed. Some of these lessons could be
formalized in imagining the world of tomorrow. E.g. the sub-process
mounts could appear within some (non-movable) alternate file-system

>> The harder answer is: How do we handle non-fungible resources such as
>> CPU assignments within a hierarchy? This is a big part of why I make
>> arguments for certain partitions being management-software only above.
>> This is imperfect, but better then where we stand today.
> I'm not following. Why is that different?

This is generally any time a change in the external-to-application's
cgroup-parent requires changes in the sub-hierarchy. This is most
visible with a resource such as a cpu which is uniquely identified,
but similarly applies to any limits.

>> > Let's build an API which actually looks and behaves like an API which
>> > is properly isolated from what external agents may do to the process.
>> > I can't see how that would be "back to where we are today". All of
>> > those are pretty critical attributes for a public kernel API and
>> > utterly broken right now.
>> Sure, but I don't think you can throw out per-thread control for all
>> controllers to enable this. Which makes everything else harder. A
>> intermediary step in unification might be that we move from N mounts
>> to 2. Those that can be managed at the process level, and those that
>> can't. It's a compromise, but may allow cleaner abstractions for the
>> former case.
> The transition can already be gradual. Why would you add yet another
> transition step?

Because what's being proposed today does not offer any replacement for
the sub-process control that we depend on today? Why would we embark
on merging the new interface before these details are sufficiently

> Thanks.
> --
> tejun
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at