Re: [PATCH V6] sched/fair: Remove group imbalance from calculate_imbalance()
From: Jeffrey Hugo
Date: Fri Jul 28 2017 - 14:09:53 EST
On 7/28/2017 7:34 AM, Dietmar Eggemann wrote:
On 28/07/17 13:59, Peter Zijlstra wrote:
On Fri, Jul 28, 2017 at 01:16:24PM +0100, Dietmar Eggemann wrote:
IIRC the topology you had in mind was MC + DIE level with n (n > 2) DIE
level sched groups.
[...]
If I then start a 3rd loop, I see 100% 50%,50%. I then kill the 100%.
Then instantly they balance and I get 2x100% back.
Yeah, could reproduce on IVB-EP (2x10x2).
OK, I have one of those. What should I do, because I didn't actually see
anything odd.
Me neither. Sorry, I was unclear. I meant I could reproduce your
example, that one of the 50% task moves to the idle cpu on this machine.
[...]
Dietmar's assessment is essentially correct. We are seeing a
circumstance where fix_small_imbalance() could fix a situation with an
idle core and an over-worked core, but it is being skipped. Here is a
detailed step through of the specific case we're seeing.
Consider the following system.
DIE [ (all cores) ]
MC [0 1 2 3] [4 5 6 7] [8 9 10 11] ... [60 61 62 63]
Not NUMA, not SMT.
Imagine we have a workload which spawns 5 CPU intensive workers, which
we want taskset to 5 cores such that each worker has its own core. We
want to see:
DIE [ (all cores) ]
MC [0 1 2 3] [4 5 6 7] [8 9 10 11] ... [60 61 62 63]
* * * * *
T T T T T
Where * indicates the taskset, and T indicates each thread/process.
In many cases, workloads are not ideally assigned at fork(), because
load statistics haven't had a chance to be generated by the new work.
This results in the following:
DIE [ (all cores) ]
MC [0 1 2 3] [4 5 6 7] [8 9 10 11] ... [60 61 62 63]
* * * * *
T T T T
T
Core 4 is over-assigned (more workers than it can efficiently handle)
and core 2 is under-assigned (no work to do). Not great, but we expect
the system to load_balance() quickly, and move one T from core 4 to core 2.
We do not see this occur, which means the scheduler is not doing its job
by leaving a core idle when there's work to do.
What happens -
Core 5 is idle, so it triggers load_balance(). At the MC level, it
identifies core 4 as busiest, and attempts to pull load. Core 5 fails
because it is not part of the taskset, and load cannot be moved to
another available core (6 or 7), so the group_imbalance flag is set.
Core 2 is idle, so it triggers load_balance(). It won't pull any load
from its siblings in MC (0, 1, or 3) because doing so does not solve an
imbalance, it just moves the imbalance around. load_balance() at MC
fails, so it moves up to DIE. Since the group containing core 4 is
group_imbalance, it is selected as busiest.
We go to calculate_imbalance(). Since each T is CPU intensive, their
load is high, lets say 200, thus busiest->load_per_task is 200.
However, we have a lot of idle or lightly loaded cores at the DIE level,
so the sds average (sds->avg_load) gets pushed down significantly; lets
say it is 20. Since busiest->group_imbalance is true, we take the min
of busiest->load_per_task and sds average (200 and 20), and make that
the new busiest->load_per_task.
We calculate env->imbalance, which we'll say is 48.
Now we hit the end check -
/*
* if *imbalance is less than the average load per runnable task
* there is no guarantee that any tasks will be moved so we'll have
* a think about bumping its value to force at least one task to be
* moved
*/
if (env->imbalance < busiest->load_per_task)
return fix_small_imbalance(env, sds);
We try to see if the calculated imbalance can be solved by moving tasks
around by comparing the calculated imbalance (env->imbalance) to the
calculated average load per task in the busiest group
(busiest->load_per_task). Since busiest->load_per_task is an average,
either there are tasks with load less than that, or all tasks are at
that value, thus if env->imbalance is more than busiest->load_per_task,
we have candidate tasks we can evaluate to address the imbalance.
env->imbalance is 48, and the modified busiest->load_per_task is 20, so
the check fails, and we exit out of calculate_imbalance() with an
imbalance to solve of 48. We jump all the way up to load_balance(), and
then into detach_tasks(). We find two tasks that are migration
candidates because core 2 is in the same taskset as core 4, so we
evaluate if they can help solve the imbalance, or would just move the
imbalance around -
if ((load / 2) > env->imbalance)
goto next;
Load of each task is 200, so 200 / 2 = 100, 100 > 48. We can't move
anything as doing so would not help solve the imbalance, just move it
around, and thus why take the hit of actually migrating the process.
We might not resolve the imbalance, but we get better CPU usage if we
move a task from core 4 to core 2, which is what we expect. Based on a
comment within the function fix_small_imbalance(), this appears to be
the correct path for this circumstance.
In fix_small_imbalance() -
/*
* OK, we don't have enough imbalance to justify moving tasks,
* however we may be able to increase total CPU capacity used by
* moving them.
*/
As mentioned above, we aren't hitting this function path for the
following reason:
if (env->imbalance < busiest->load_per_task)
return fix_small_imbalance(env, sds);
env->imbalance is 48, and busiest->load_per_task is 20, but
busiest->load_per_task was 200. 48 < 200 is true, so we would hit
fix_small_imbalance. The group_imbalance path modifies
busiest->load_per_task, so what happens when we remove that path (i.e.
this change)? We see the behavior we expect.
This -
DIE [ (all cores) ]
MC [0 1 2 3] [4 5 6 7] [8 9 10 11] ... [60 61 62 63]
* * * * *
T T T T
T
Goes to this -
DIE [ (all cores) ]
MC [0 1 2 3] [4 5 6 7] [8 9 10 11] ... [60 61 62 63]
* * * * *
T T T T T
So, we've identified an issue, and root caused it to a small code
snippet. Great for us, but the scheduler is generic code, it needs to
work for all. Can this code really be removed?
Here is our current thinking on all uses of this code path. Let us know
if we missed anything.
SMT avoids this issue because of SD_PREFER_SIBLING - group overloaded is
like group imbalanced, but does not set busiest->group_imbalance so
busiest->load_per_task never gets modified.
In cases where the path is actually used:
1- busiest->group_imbalance is not set. The code path is equal to a
NOP, therefore it has no positive or negative effect.
2- busiest->group_imbalance is set, but sds->avg_load is greater than
busiest->load_per_task. busiest->load_per_task gets assigned the value
of busiest->load_per_task, so nothing changes. The code path is equal
to a NOP, therefore it has no positive or negative effect.
3- busiest->group_imbalance is set, but sds->avg_load is lesser than
busiest->load_per_task. busiest->load_per_task gets assigned the value
of sds->avg_load. If the calculated imbalance (env->imbalance) is more
than the original busiest->load_per_task, it will be more than
sds->avg_load, so we do not hit fix_small_imbalance() as expected. The
code path is equal to a NOP, therefore it has no positive or negative
effect.
4- busiest->group_imbalance is set, but sds->avg_load is lesser than
busiest->load_per_task. busiest->load_per_task gets assigned the value
of sds->avg_load. If the calculated imbalance (env->imbalance) is less
than the original busiest->load_per_task, but also less than
sds->avg_load, we hit fix_small_imbalance(), but would have hit that
anyways. The code path is equal to a NOP, therefore it has no positive
or negative effect.
5- busiest->group_imbalance is set, but sds->avg_load is lesser than
busiest->load_per_task. busiest->load_per_task gets assigned the value
of sds->avg_load. If the calculated imbalance (env->imbalance) is less
than the original busiest->load_per_task, but more than sds->avg_load,
we do not hit fix_small_imbalance(). If we happen to have tasks we can
move around based on their load values compared to the current
env->imbalance, we can make migrations. The code path is equal to a
NOP, therefore it has no positive or negative effect.
6- busiest->group_imbalance is set, but sds->avg_load is lesser than
busiest->load_per_task. busiest->load_per_task gets assigned the value
of sds->avg_load. If the calculated imbalance (env->imbalance) is less
than the original busiest->load_per_task, but more than sds->avg_load,
we do not hit fix_small_imbalance(). If we do not have tasks which can
migrate based on their load value compared to the current
env->imbalance, we hit the issue we've described above. The code path
has affected the behavior of the scheduler in a way which is not
expected, therefore it has a negative effect.
So, in all of these scenarios but one, the group_imbalance code path of
calculate_imbalance() does nothing. In the remaining one scenario, the
code path prevents the scheduler from doing its job (assigning work to
idle cores), so it harms the system.
Thus we conclude that the code has no benefit, and should be removed.
Considered from a different perspective, busiest->load_per_task, in the
current algorithm, represents actual load that can be moved around.
Setting it to a global average breaks this invariant and causes a faulty
decision at the fix_small_imbalance() path.
In summary, our system is harmed by the code branch, and our analysis
indicates the branch has no benefit to the scheduler as it sits today,
we feel justified in removing it.
--
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.