Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

From: Kamezawa Hiroyuki
Date: Mon Aug 24 2015 - 22:37:33 EST

On 2015/08/25 8:15, Paul Turner wrote:
On Mon, Aug 24, 2015 at 3:49 PM, Tejun Heo <tj@xxxxxxxxxx> wrote:

On Mon, Aug 24, 2015 at 03:03:05PM -0700, Paul Turner wrote:
Hmm... I was hoping for an actual configurations and usage scenarios.
Preferably something people can set up and play with.

This is much easier to set up and play with synthetically. Just
create the 10 threads and 100 threads above then experiment with
configurations designed at guaranteeing the set of 100 threads
relatively uniform throughput regardless of how many are active. I
don't think trying to run a VM stack adds anything except complexity
of reproduction here.

Well, but that loses most of details and why such use cases matter to
begin with. We can imagine up stuff to induce arbitrary set of

All that's being proved or disproved here is that it's difficult to
coordinate the consumption of asymmetric thread pools using nice. The
constraints here were drawn from a real-world example.

I take that the
CPU intensive helper threads are usually IO workers? Is the scenario
where the VM is set up with a lot of IO devices and different ones may
consume large amount of CPU cycles at any given point?

Yes, generally speaking there are a few major classes of IO (flash,
disk, network) that a guest may invoke. Each of these backends is
separate and chooses its own threading.

Hmmm... if that's the case, would limiting iops on those IO devices
(or classes of them) work? qemu already implements IO limit mechanism
after all.


1) They should proceed at the maximum rate that they can that's still
within their provisioning budget.
2) The cost/IO is both inconsistent and changes over time. Attempting
to micro-optimize every backend for this is infeasible, this is
exactly the type of problem that the scheduler can usefully help
3) Even pretending (2) is fixable, dynamically dividing these
right-to-work tokens between different I/O device backends is
extremely complex.

I think I should explain my customer's use case of qemu + cpuset/cpu (via libvirt)

(1) Isolating hypervisor thread.
As already discussed, hypervisor threads are isolated by cpuset. But their purpose
is to avoid _latency_ spike caused by hypervisor behavior. So, "nice" cannot be solution
as already discussed.

(2) Fixed rate vcpu service.
With using cpu controller's quota/period feature, my customer creates vcpu models like
Low(1GHz), Mid(2GHz), High(3GHz) for IaaS system.

To do this, each vcpus should be quota-limited independently, with per-thread cpu control.

Especially, the method (1) is used in several enterprise customers for stabilizing their system.

Sub-process control should be provided by some way.


Anyways, a point here is that threads of the same process competing
isn't a new problem. There are many ways to make those threads play
nice as the application itself often has to be involved anyway,
especially for something like qemu which is heavily involved in
provisioning resources.

It's certainly not a new problem, but it's a real one, and it's
_hard_. You're proposing removing the best known solution.

cgroups can be a nice brute-force add-on which lets sysadmins do wild
things but it's inherently hacky and incomplete for coordinating
threads. For example, what is it gonna do if qemu cloned vcpus and IO
helpers dynamically off of the same parent thread?

We're talking about sub-process usage here. This is the application
coordinating itself, NOT the sysadmin. Processes are becoming larger
and larger, we need many of the same controls within them that we have
between them.

It requires
application's cooperation anyway but at the same time is painful to
actually interact from those applications.

As discussed elsewhere on thread this is really not a problem if you
define consistent rules with respect to which parts are managed by
who. The argument of potential interference is no different to
messing with an application's on-disk configuration behind its back.
Alternate strawmen which greatly improve this from where we are today
have also been proposed.


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at