Re: [RFC] sched: unused cpu in affine workload
From: Ingo Molnar
Date: Mon Apr 04 2016 - 05:20:01 EST
* Ingo Molnar <mingo@xxxxxxxxxx> wrote:
> - if you want to come up with a 'complete' solution then please don't put it into
> hot paths such as wakeup or context switching, or any of the hardirq methods,
> but try to integrate it with the NUMA scheduling slow path.
>
> The NUMA balancing slow path: that is softirq driven and reasonably low freq to
> not cause many performance problems.
>
> The two problems (NUMA affinity and user affinity) are also losely related on a
> conceptual level: the NUMA affinity optimization problem can be considered as a
> workload determined, arbitrary 'NUMA mask' being optimized from first
> principles.
>
> There's one ABI detail: this is true only as long as SMP affinity masks follow
> node boundaries - the current NUMA balancing code is very much node granular, so
> the two can only be merged if the ->cpus_allowed mask follows node boundaries as
> well.
>
> A third approach would be to extend the NUMA balancing code to be CPU granular
> (without changing anytask placement behavior of the current NUMA balancing code
> of course), with node granular being a special case. This would fit the cgroups
> (and virtualization) usecases, but that would be a major change.
So my thinking here is: if the NUMA balancing code (which is node granular at the
moment and uses node masks, etc.) is extended to be CPU granular (which is a big
task in itself), then the two problems can be 'unified':
- the NUMA balancing code inputs arbitrarly CPU (node) affinity masks from the
MM code into the scheduler.
- the scheduler syscall ABI (and other configuration sources) inputs arbitrary
CPU affinity masks into the scheduler.
it's a similar problem, with two (minor looking) complication:
- the NUMA code right now is 'statistical', while ->cpus_allowed are hard
constraints that must never be violated. So there always has to be a final
layer to implement the hard constraint - which does not exist in the NUMA
balancing case. This should be relatively easy I think as we already do it
with the regular balancer.
- the balancing slowpath would have to be activated on non-NUMA systems as well,
so that it can handle ->cpus_allowed balancing.
... once all that is solved, I can see several advantages from unifying the NUMA
balancing and SMP affinity balancing code:
- the NUMA balancer would improve: cpus_allowed isolation is used more
frequently, so fixes from those workloads would benefit the NUMA balancing case
as well.
- testing the NUMA balancer would become easier: we'd simply set cpus_allowed and
would watch how it balances. No need to coax workloads into actual MM NUMA
usage patters to set up interesting scenarios.
- our existing half-hearted ways to deal with cpus_allowed balancing could be
outsourced to the NUMA slow path, which would simplify the SMP balancing fast
path.
But it's a major piece of work, and I might be missing implementational details.
It would be the biggest new scheduler feature since NUMA balancing for sure.
Thanks,
Ingo