Re: [PATCH] sched: Fix numabalancing to work with isolated cpus

From: Peter Zijlstra
Date: Thu Apr 06 2017 - 03:37:15 EST


On Tue, Apr 04, 2017 at 10:57:28PM +0530, Srikar Dronamraju wrote:
> When performing load balancing, numabalancing only looks at
> task->cpus_allowed to see if the task can run on the target cpu. If
> isolcpus kernel parameter is set, then isolated cpus will not be part of
> mask task->cpus_allowed.
>
> For example: (On a Power 8 box running in smt 1 mode)
>
> isolcpus=56,64,72,80,88
>
> Cpus_allowed_list: 0-55,57-63,65-71,73-79,81-87,89-175
> /proc/20996/task/20996/status:Cpus_allowed_list: 0-55,57-63,65-71,73-79,81-87,89-175
> /proc/20996/task/20997/status:Cpus_allowed_list: 0-55,57-63,65-71,73-79,81-87,89-175
> /proc/20996/task/20998/status:Cpus_allowed_list: 0-55,57-63,65-71,73-79,81-87,89-175
>
> Note: offline cpus are excluded in cpus_allowed_list.
>
> However a task might call sched_setaffinity() that includes all possible
> cpus in the system including the isolated cpus.
>
> For example:
> perf bench numa mem --no-data_rand_walk -p 4 -t $THREADS -G 0 -P 3072 -T 0 -l 50 -c -s 1000
> would call sched_setaffinity that resets the cpus_allowed mask.
>
> Cpus_allowed_list: 0-55,57-63,65-71,73-79,81-87,89-175
> Cpus_allowed_list: 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> Cpus_allowed_list: 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> Cpus_allowed_list: 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> Cpus_allowed_list: 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
>
> The isolated cpus are part of the cpus allowed list. In the above case,
> numabalancing ends up scheduling some of these tasks on isolated cpus.
>
> To avoid this, please check for isolated cpus before choosing a target
> cpu.

Is there anything stopping the numa balancer taking tasks off an
isolated CPU?

Its been too long since I've looked at the NUMA bits; but from a quick
reading we mostly completely ignore the sched domain stuff.

That means there's likely to be more holes here, and just plugging them
as we find them doesn't appear to be the best approach.

For example, if set use cpusets to partition the scheduler, but somehow
leave a task in the root group, numa balancer looks like it will happily
migrate tasks between the partitions.

So please try and fix the bigger problem, then I think this one will go
away as well.