Re: [RFC 00/60] Coscheduling for Linux

From: Peter Zijlstra
Date: Fri Sep 14 2018 - 07:13:02 EST

On Fri, Sep 07, 2018 at 11:39:47PM +0200, Jan H. Schönherr wrote:
> This patch series extends CFS with support for coscheduling. The
> implementation is versatile enough to cover many different coscheduling
> use-cases, while at the same time being non-intrusive, so that behavior of
> legacy workloads does not change.

I don't call this non-intrusive.

> Peter Zijlstra once called coscheduling a "scalability nightmare waiting to
> happen". Well, with this patch series, coscheduling certainly happened.

I'll beg to differ; this isn't anywhere near something to consider
merging. Also 'happened' suggests a certain stage of completeness, this
again doesn't qualify.

> However, I disagree on the scalability nightmare. :)

There are known scalability problems with the existing cgroup muck; you
just made things a ton worse. The existing cgroup overhead is
significant, you also made that many times worse.

The cgroup stuff needs cleanups and optimization, not this.

> B) Why would I want this?

> In the L1TF context, it prevents other applications from loading
> additional data into the L1 cache, while one application tries to leak
> data.

That is the whole and only reason you did this; and it doesn't even
begin to cover the requirements for it.

Not to mention I detest cgroups; for their inherent complixity and the
performance costs associated with them. _If_ we're going to do
something for L1TF then I feel it should not depend on cgroups.

It is after all, perfectly possible to run a kvm thingy without cgroups.

> 1. Execute parallel applications that rely on active waiting or synchronous
> execution concurrently with other applications.
> The prime example in this class are probably virtual machines. Here,
> coscheduling is an alternative to paravirtualized spinlocks, pause loop
> exiting, and other techniques with its own set of advantages and
> disadvantages over the other approaches.

Note that in order to avoid PLE and paravirt spinlocks and paravirt
tlb-invalidate you have to gang-schedule the _entire_ VM, not just SMT

Now explain to me how you're going to gang-schedule a VM with a good
number of vCPU threads (say spanning a number of nodes) and preserving
the rest of CFS without it turning into a massive trainwreck?

Such things (gang scheduling VMs) _are_ possible, but not within the
confines of something like CFS, they are also fairly inefficient
because, as you do note, you will have to explicitly schedule idle time
for idle vCPUs.

Things like the Tableau scheduler are what come to mind; but I'm not
sure how to integrate that with a general purpose scheduling scheme. You
pretty much have to dedicate a set of CPUs to just scheduling VMs with
such a scheduler.

And that would call for cpuset-v2 integration along with a new
scheduling class.

And then people will complain again that partitioning a system isn't
dynamic enough and we need magic :/

(and this too would be tricky to virtualize itself)

> C) How does it work?
> --------------------
> This patch series introduces hierarchical runqueues, that represent larger
> and larger fractions of the system. By default, there is one runqueue per
> scheduling domain. These additional levels of runqueues are activated by
> the "cosched_max_level=" kernel command line argument. The bottom level is
> 0.

You gloss over a ton of details here; many of which are non trivial and
marked broken in your patches. Unless you have solid suggestions on how
to deal with all of them, this is a complete non-starter.

The per-cpu IRQ/steal time accounting for example. The task timeline
isn't the same on every CPU because of those.

You now basically require steal time and IRQ load to match between CPUs.
That places very strict requirements and effectively breaks virt
invariance. That is, the scheduler now behaves significantly different
inside a VM than it does outside of it -- without the guest being gang
scheduled itself and having physical pinning to reflect the same
topology the coschedule=1 thing should not be exposed in a guest. And
that is a mayor failing IMO.

Also; I think you're sharing a cfs_rq between CPUs:

+ init_cfs_rq(&sd->shared->rq.cfs);

that is broken, the virtual runtime stuff needs nontrivial modifications
for multiple CPUs. And if you do that, I've no idea how you're dealing
with SMP affinities.

> You currently have to explicitly set affinities of tasks within coscheduled
> task groups, as load balancing is not implemented for them at this point.

You don't even begin to outline how you preserve smp-nice fairness.

> D) What can I *not* do with this?
> ---------------------------------
> Besides the missing load-balancing within coscheduled task-groups, this
> implementation has the following properties, which might be considered
> short-comings.
> This particular implementation focuses on SCHED_OTHER tasks managed by CFS
> and allows coscheduling them. Interrupts as well as tasks in higher
> scheduling classes are currently out-of-scope: they are assumed to be
> negligible interruptions as far as coscheduling is concerned and they do
> *not* cause a preemption of a whole group. This implementation could be
> extended to cover higher scheduling classes. Interrupts, however, are an
> orthogonal issue.
> The collective context switch from one coscheduled set of tasks to another
> -- while fast -- is not atomic. If a use-case needs the absolute guarantee
> that all tasks of the previous set have stopped executing before any task
> of the next set starts executing, an additional hand-shake/barrier needs to
> be added.

IOW it's completely friggin useless for L1TF.

> E) What's the overhead?
> -----------------------
> Each (active) hierarchy level has roughly the same effect as one additional
> level of nested cgroups. In addition -- at this stage -- there may be some
> additional lock contention if you coschedule larger fractions of the system
> with a dynamic task set.

Have you actually read your own code?

What about that atrocious locking you sprinkle all over the place?
'some additional lock contention' doesn't even begin to describe that
horror show.

Hint: we're not going to increase the lockdep subclasses, and most
certainly not for scheduler locking.

All in all, I'm not inclined to consider this approach, it complicates
an already overly complicated thing (cpu-cgroups) and has a ton of
unresolved issues while at the same time it doesn't (and cannot) meet
the goal it was made for.