Re: [RFC] sched: unused cpu in affine workload

From: Peter Zijlstra
Date: Mon Apr 04 2016 - 04:44:23 EST


On Mon, Apr 04, 2016 at 10:23:02AM +0200, Jiri Olsa wrote:
> hi,
> we've noticed following issue in one of our workloads.
>
> I have 24 CPUs server with following sched domains:
> domain 0: (pairs)
> domain 1: 0-5,12-17 (group1) 6-11,18-23 (group2)
> domain 2: 0-23 level NUMA
>
> I run CPU hogging workload on following CPUs:
> 4,6,14,18,19,20,23
>
> that is:
> 4,14 CPUs from group1
> 6,18,19,20,23 CPUs from group2
>
> the workload process gets affinity setup via 'taskset -c ${CPUs workload ...'
> and forks child for every CPU
>
> very often we notice CPUs 4 and 14 running 3 processes of the workload
> while CPUs 6,18,19,20,23 running just 4 processes, leaving one of the
> CPU from group2 idle
>
> AFAICS from the code the reason for this is that the load balancing
> follows domains setup (topology) and does not regard affinity setups
> like this. The code in find_busiest_group running under idle cpu from
> group2 will find group1 as bussiest, but its average load will be
> smaller than the one on the local group, so there's no task pulling.
>
> It's obvious, that load balancer follows sched domain topology.
> However is there some sched feature I'm missing that could help
> with this? Or do we need to follow sched domains topology when
> we select CPUs for workload to get even balancing?

Yeah, this is 'hard', there is some code that tries not to totally blow
with this but its all a bit of a mess. See
kernel/sched/fair.c:sg_imbalanced().

The easiest solution is to simply not do this and stick with the topo
like you suggest.

So far I've not come up with a sane/stable solution for this problem.