Re: [RFC 00/60] Coscheduling for Linux

From: Jan H. SchÃnherr
Date: Wed Sep 26 2018 - 05:36:01 EST

On 09/17/2018 03:37 PM, Peter Zijlstra wrote:
> On Fri, Sep 14, 2018 at 06:25:44PM +0200, Jan H. SchÃnherr wrote:
>> With gang scheduling as defined by Feitelson and Rudolph [6], you'd have to
>> explicitly schedule idle time. With coscheduling as defined by Ousterhout [7],
>> you don't. In this patch set, the scheduling of idle time is "merely" a quirk
>> of the implementation. And even with this implementation, there's nothing
>> stopping you from down-sizing the width of the coscheduled set to take out
>> the idle vCPUs dynamically, cutting down on fragmentation.
> The thing is, if you drop the full width gang scheduling, you instantly
> require the paravirt spinlock / tlb-invalidate stuff again.

Can't say much about tlb-invalidate, but yes to the spinlock stuff: if
there isn't any additional information available, all runnable tasks/vCPUs
have to be coscheduled to avoid lock holder preemption.

With additional information about tasks potentially holding locks or
potentially spinning on a lock, it would be possible to coschedule smaller
subsets -- no idea if that would be any more efficient though.

> Of course, the constraints of L1TF itself requires the explicit
> scheduling of idle time under a bunch of conditions.

That is true for some of the resource contention use cases, too. Though,
they are much more relaxed wrt. their requirements on the simultaneousness
of the context switch.

> I did not read your [7] in much detail (also very bad quality scan that
> :-/; but I don't get how they leap from 'thrashing' to co-scheduling.

In my personal interpretation, that analogy refers to the case where the
waiting time for a lock is shorter than the time for a context switch --
but where the context switch was done anyway, "thrashing" the CPU.

Anyway. I only brought it up, because everyone has a different
understanding of what "coscheduling" or "gang scheduling" actually means.
The memorable quotes are from Ousterhout:

"A task force is coscheduled if all of its runnable processes are exe-
cuting simultaneously on different processors. Each of the processes
in that task force is also said to be coscheduled."

(where a "task force" is a group of closely cooperating tasks), and from
Feitelson and Rudolph:

"[Gang scheduling is defined] as the scheduling of a group of threads
to run on a set of processors at the same time, on a one-to-one

(with the additional assumption of time slices, collective preemption,
and that threads don't relinquish the CPU during their time slice).

That makes gang scheduling much more specific, while coscheduling just
refers to the fact that some things are executed simultaneously.

> Their initial problem, where A generates data that B needs and the 3
> scenarios:
> 1) A has to wait for B
> 2) B has to wait for A
> 3) the data gets buffered
> Seems fairly straight forward and is indeed quite common, needing
> co-scheduling for that, I'm not convinced.
> We have of course added all sorts of adaptive wait loops in the kernel
> to deal with just that issue.
> With co-scheduling you 'ensure' B is running when A is, but that doesn't
> mean you can actually make more progress, you could just be burning a
> lot of CPu cycles (which could've been spend doing other work).

I don't think, that coscheduling should be applied blindly.

Just like the adaptive wait loops you mentioned: in the beginning there
was active waiting; it wasn't that great, so passive waiting was invented;
turns out, the overhead is too high in some cases, so let's spin adaptively
for a moment.

We went from uncoordinated scheduling to system-wide coordinated scheduling
(which turned out to be not very efficient for many cases). And now we are
in the phase to find the right adaptiveness. There is work on enabling
coscheduling only on-demand (when a parallel application profits from it)
or to be more fuzzy about it (giving the scheduler more freedom); there is
work to go away from system-wide coordination to (dynamically) smaller
isles (where I see my own work as well). And "recently" we also have the
resource contention and security use cases leaving their impression on the
topic as well.

> I'm also not convinced co-scheduling makes _any_ sense outside SMT --
> does one of the many papers you cite make a good case for !SMT
> co-scheduling? It just doesn't make sense to co-schedule the LLC domain,
> that's 16+ cores on recent chips.

There's the resource contention stuff, much of which targets the last
level cache or memory controller bandwidth. So, that is making a case for
coscheduling larger parts than SMT. However, I didn't find anything in a
short search that would already cover some of the more recent processors
with 16+ cores.

There's the auto-tuning of parallel algorithms to a certain system
architecture. That would also profit from LLC coscheduling (and slightly
larger time slices) to run multiple of those in parallel. Again, no idea
for recent processors.

There's work to coschedule whole clusters, which goes beyond the scope of a
single system, but also predates recent systems. (Search for, e.g.,
"implicit coscheduling").

So, 16+ cores is unknown territory, AFAIK. But not every recent system has
16+ cores, or will have 16+ cores in the near future.