Re: [RFC 00/60] Coscheduling for Linux

From: Peter Zijlstra
Date: Mon Sep 17 2018 - 09:37:18 EST

On Fri, Sep 14, 2018 at 06:25:44PM +0200, Jan H. Schönherr wrote:
> On 09/14/2018 01:12 PM, Peter Zijlstra wrote:

> >> 1. Execute parallel applications that rely on active waiting or synchronous
> >> execution concurrently with other applications.
> >>
> >> The prime example in this class are probably virtual machines. Here,
> >> coscheduling is an alternative to paravirtualized spinlocks, pause loop
> >> exiting, and other techniques with its own set of advantages and
> >> disadvantages over the other approaches.
> >
> > Note that in order to avoid PLE and paravirt spinlocks and paravirt
> > tlb-invalidate you have to gang-schedule the _entire_ VM, not just SMT
> > siblings.
> >
> > Now explain to me how you're going to gang-schedule a VM with a good
> > number of vCPU threads (say spanning a number of nodes) and preserving
> > the rest of CFS without it turning into a massive trainwreck?
> You probably don't -- for the same reason, why it is a bad idea to give
> an endless loop realtime priority. It's just a bad idea. As I said in the
> text you quoted: coscheduling comes with its own set of advantages and
> disadvantages. Just because you find one example, where it is a bad idea,
> doesn't make it a bad thing in general.
> > Such things (gang scheduling VMs) _are_ possible, but not within the
> > confines of something like CFS, they are also fairly inefficient
> > because, as you do note, you will have to explicitly schedule idle time
> > for idle vCPUs.
> With gang scheduling as defined by Feitelson and Rudolph [6], you'd have to
> explicitly schedule idle time. With coscheduling as defined by Ousterhout [7],
> you don't. In this patch set, the scheduling of idle time is "merely" a quirk
> of the implementation. And even with this implementation, there's nothing
> stopping you from down-sizing the width of the coscheduled set to take out
> the idle vCPUs dynamically, cutting down on fragmentation.

The thing is, if you drop the full width gang scheduling, you instantly
require the paravirt spinlock / tlb-invalidate stuff again.

Of course, the constraints of L1TF itself requires the explicit
scheduling of idle time under a bunch of conditions.

I did not read your [7] in much detail (also very bad quality scan that
:-/; but I don't get how they leap from 'thrashing' to co-scheduling.
Their initial problem, where A generates data that B needs and the 3

1) A has to wait for B
2) B has to wait for A
3) the data gets buffered

Seems fairly straight forward and is indeed quite common, needing
co-scheduling for that, I'm not convinced.

We have of course added all sorts of adaptive wait loops in the kernel
to deal with just that issue.

With co-scheduling you 'ensure' B is running when A is, but that doesn't
mean you can actually make more progress, you could just be burning a
lot of CPu cycles (which could've been spend doing other work).

I'm also not convinced co-scheduling makes _any_ sense outside SMT --
does one of the many papers you cite make a good case for !SMT
co-scheduling? It just doesn't make sense to co-schedule the LLC domain,
that's 16+ cores on recent chips.