Re: [PATCH] sched: Fix numabalancing to work with isolated cpus

From: Michal Hocko
Date: Wed Apr 05 2017 - 12:46:07 EST


On Wed 05-04-17 20:52:15, Srikar Dronamraju wrote:
> * Michal Hocko <mhocko@xxxxxxxxxx> [2017-04-05 14:57:43]:
>
> > On Tue 04-04-17 22:57:28, Srikar Dronamraju wrote:
> > [...]
> > > For example:
> > > perf bench numa mem --no-data_rand_walk -p 4 -t $THREADS -G 0 -P 3072 -T 0 -l 50 -c -s 1000
> > > would call sched_setaffinity that resets the cpus_allowed mask.
> > >
> > > Cpus_allowed_list: 0-55,57-63,65-71,73-79,81-87,89-175
> > > Cpus_allowed_list: 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> > > Cpus_allowed_list: 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> > > Cpus_allowed_list: 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> > > Cpus_allowed_list: 0,8,16,24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144,152,160,168
> > >
> > > The isolated cpus are part of the cpus allowed list. In the above case,
> > > numabalancing ends up scheduling some of these tasks on isolated cpus.
> >
> > Why is this bad? If the task is allowed to run on isolated CPUs then why
>
> 1. kernel-parameters.txt states: isolcpus as "Isolate CPUs from the
> general scheduler." So the expectation that numabalancing can schedule
> tasks on it is wrong.

Right but if the task is allowed to run on isolated cpus then the numa
balancing for this taks should be allowed to run on those cpus, no?
Say your application would be bound _only_ to isolated cpus. Should that
imply no numa balancing at all?

> 2. If numabalancing was disabled, the task would never run on the
> isolated CPUs.

I am confused. I thought you said "However a task might call
sched_setaffinity() that includes all possible cpus in the system
including the isolated cpus." So the task is allowed to run there.
Or am I missing something?

> 3. With the faulty behaviour, it was observed that tasks scheduled on
> the isolated cpus might end up taking more time, because they never get
> a chance to move back to a node which has local memory.

I am not sure I understand.

> 4. The isolated cpus may be idle at that point, but actual work may be
> scheduled on isolcpus later (when numabalancing had already scheduled
> work on to it.) Since scheduler doesnt do any balancing on isolcpus even
> if they are overloaded and the system is completely free, the isolcpus
> stay overloaded.

Please note that I do not claim the patch is wrong. I am still not sure
myself but the chagelog is missing the most important information "why
the change is the right thing".
--
Michal Hocko
SUSE Labs