Re: [RFC 00/60] Coscheduling for Linux

From: Jan H. SchÃnherr
Date: Fri Oct 19 2018 - 07:40:17 EST

On 17/10/2018 04.09, Frederic Weisbecker wrote:
> On Fri, Sep 07, 2018 at 11:39:47PM +0200, Jan H. SchÃnherr wrote:
>> C) How does it work?
>> --------------------
>> For each task-group, the user can select at which level it should be
>> scheduled. If you set "cpu.scheduled" to "1", coscheduling will typically
>> happen at core-level on systems with SMT. That is, if one SMT sibling
>> executes a task from this task group, the other sibling will do so, too. If
>> no task is available, the SMT sibling will be idle. With "cpu.scheduled"
>> set to "2" this is extended to the next level, which is typically a whole
>> socket on many systems. And so on. If you feel, that this does not provide
>> enough flexibility, you can specify "cosched_split_domains" on the kernel
>> command line to create more fine-grained scheduling domains for your
>> system.
> Have you considered using cpuset to specify the set of CPUs inside which
> you want to coschedule task groups in? Perhaps that would be more flexible
> and intuitive to control than this cpu.scheduled value.

Yes, I did consider cpusets. Though, there are two dimensions to it:
a) at what fraction of the system tasks shall be coscheduled, and
b) where these tasks shall execute within the system.

cpusets would be the obvious answer to the "where". However, in the current
form they are too inflexible with too much overhead. Suppose, you want to
coschedule two tasks on SMT siblings of a core. You would be able to
restrict the tasks to a specific core with a cpuset. But then, it is bound
to that core, and the load balancer cannot move the group of two tasks to a
different core.

Now, it would be possible to "invent" relocatable cpusets to address that
issue ("I want affinity restricted to a core, I don't care which"), but
then, the current way how cpuset affinity is enforced doesn't scale for
making use of it from within the balancer. (The upcoming load balancing
portion of the coscheduler currently uses a file similar to cpu.scheduled
to restrict affinity to a load-balancer-controlled subset of the system.)

Using cpusets as the mean to describe which parts of the system are to be
coscheduled *may* be possible. But if so, it's a long way out. The current
implementation uses scheduling domains for this, because (a) most
coscheduling use cases require an alignment to the topology, and (b) it
integrates really nicely with the load balancer.

AFAIK, there is already some interaction between cpusets and scheduling
domains. But it is supposed to be rather static and as soon as you have
overlapping cpusets, you end up with the default scheduling domains.
If we were able to make the scheduling domains more dynamic than they are
today, we might be able to couple that to cpusets (or some similar
interface to *define* scheduling domains).