Re: [RFC] sched: unused cpu in affine workload

From: Ingo Molnar
Date: Mon Apr 04 2016 - 04:59:54 EST



* Jiri Olsa <jolsa@xxxxxxxxxx> wrote:

> hi,
> we've noticed following issue in one of our workloads.
>
> I have 24 CPUs server with following sched domains:
> domain 0: (pairs)
> domain 1: 0-5,12-17 (group1) 6-11,18-23 (group2)
> domain 2: 0-23 level NUMA
>
> I run CPU hogging workload on following CPUs:
> 4,6,14,18,19,20,23
>
> that is:
> 4,14 CPUs from group1
> 6,18,19,20,23 CPUs from group2
>
> the workload process gets affinity setup via 'taskset -c ${CPUs workload ...'
> and forks child for every CPU
>
> very often we notice CPUs 4 and 14 running 3 processes of the workload
> while CPUs 6,18,19,20,23 running just 4 processes, leaving one of the
> CPU from group2 idle
>
> AFAICS from the code the reason for this is that the load balancing
> follows domains setup (topology) and does not regard affinity setups
> like this. The code in find_busiest_group running under idle cpu from
> group2 will find group1 as bussiest, but its average load will be
> smaller than the one on the local group, so there's no task pulling.
>
> It's obvious, that load balancer follows sched domain topology.
> However is there some sched feature I'm missing that could help
> with this? Or do we need to follow sched domains topology when
> we select CPUs for workload to get even balancing?

Yeah, so the principle with user-pinning of tasks to CPUs was always:

- pinning a task to a single CPU should obviously work fine, it's the primary
usecase for isolation.

- pinning a task to an arbitrary subset of CPUs is a 'hard' problem
mathematically that the scheduler never truly wanted to solve in a frontal
fashion.

... but that principle was set into place well before we did the NUMA scheduling
work, which in itself is a highly non-trivial load optimization problem to begin
with, so we might want to reconsider.

So there's two directions I can suggest:

- if you can come up with workable small-scale solutions to scratch an itch
that comes up in practice then that's obviously good, as long as it does not
regress anything else.

- if you want to come up with a 'complete' solution then please don't put it into
hot paths such as wakeup or context switching, or any of the hardirq methods,
but try to integrate it with the NUMA scheduling slow path.

The NUMA balancing slow path: that is softirq driven and reasonably low freq to
not cause many performance problems.

The two problems (NUMA affinity and user affinity) are also losely related on a
conceptual level: the NUMA affinity optimization problem can be considered as a
workload determined, arbitrary 'NUMA mask' being optimized from first principles.

There's one ABI detail: this is true only as long as SMP affinity masks follow
node boundaries - the current NUMA balancing code is very much node granular, so
the two can only be merged if the ->cpus_allowed mask follows node boundaries as
well.

A third approach would be to extend the NUMA balancing code to be CPU granular
(without changing anytask placement behavior of the current NUMA balancing code of
course), with node granular being a special case. This would fit the cgroups (and
virtualization) usecases, but that would be a major change.

Thanks,

Ingo