On Wed, Jun 07, 2017 at 01:18:58PM -0600, Jeffrey Hugo wrote:
The group_imbalance path in calculate_imbalance() made sense when it was
added back in 2007 with commit 908a7c1b9b80 ("sched: fix improper load
balance across sched domain") because busiest->load_per_task factored into
the amount of imbalance that was calculated. That is not the case today.
It would be nice to have some more information on which patch(es)
changed that.
The group_imbalance path can only affect the outcome of
calculate_imbalance() when the average load of the domain is less than the
original busiest->load_per_task. In this case, busiest->load_per_task is
overwritten with the scheduling domain load average. Thus
busiest->load_per_task no longer represents actual load that can be moved.
At the final comparison between env->imbalance and busiest->load_per_task,
imbalance may be larger than the new busiest->load_per_task causing the
check to fail under the assumption that there is a task that could be
migrated to satisfy the imbalance. However env->imbalance may still be
smaller than the original busiest->load_per_task, thus it is unlikely that
there is a task that can be migrated to satisfy the imbalance.
Calculate_imbalance() would not choose to run fix_small_imbalance() when we
expect it should. In the worst case, this can result in idle cpus.
Since the group imbalance path in calculate_imbalance() is at best a NOP
but otherwise harmful, remove it.
load_per_task is horrible and should die. Ever since we did cgroup
support the number is complete crap, but even before that the concept
was dubious.
Most of the logic that uses the number stems from the pre-smp-nice era.
This also of course means that fix_small_imbalance() is probably a load
of crap. Digging through all that has been on the todo list for a long
while but somehow not something I've ever gotten to :/