Re: [PATCH v2 5/5] sched/fair: Fix use of find_idlest_group when local group is idlest.

From: Vincent Guittot
Date: Mon Aug 28 2017 - 04:57:00 EST


On 25 August 2017 at 17:51, Brendan Jackman <brendan.jackman@xxxxxxx> wrote:
>
> On Fri, Aug 25 2017 at 13:38, Vincent Guittot wrote:
>> On 25 August 2017 at 12:16, Brendan Jackman <brendan.jackman@xxxxxxx> wrote:
>>> find_idlest_group currently returns NULL when the local group is
>>> idlest. The caller then continues the find_idlest_group search at a
>>> lower level of the current CPU's sched_domain hierarchy. However
>>> find_idlest_group_cpu is not consulted and, crucially, @new_cpu is
>>> not updated. This means the search is pointless and we return
>>> @prev_cpu from select_task_rq_fair.
>>>
>>> This patch makes find_idlest_group return the idlest group, and never
>>> NULL, and then has the caller unconditionally continue to consult
>>> find_idlest_group_cpu and update @new_cpu.
>>>
>>> * * *
>>>
>>> An alternative, simpler, fix for this would be to initialize @new_cpu
>>> to @cpu instead of @prev_cpu, at the beginning of the new
>>> find_idlest_cpu. However this would not fix the fact that
>>
>> Yes this is simpler
>>
>>> find_idlest_group_cpu goes unused, and we miss out on the bias toward
>>> a shallower-idle CPU, unless find_idlest_group picks a non-local
>>> group.
>>
>> I'm not sure to catch why it's not enough.
>> If no cpu of sched_domain span is allowed, we continue to return prev_cpu
>> But if at least 1 cpu is allowed, the default new_cpu is cpu in case
>> it is really the idlest and no other group will be returned by
>> find_idlest_group
>> Otherwise, find_idlest_group will finally return another group than
>> the local one when running with sd->child
>
> This patch isn't about affinity but just about which group is idlest.

So this patch changes 2 things:
- don't return @prev if @cpu is the idlest one
- always call find_idlest_group_cpu even for local group

The 1st point is a bug but the 2nd one is more a matter of policy.
Current implementation favors locality and load/capacity over current
idle state of CPU which is only used at a last resort to select a cpu.
What you are proposing, it's to increase the weight of the current
idle state of cpus in the selection of the cpu in order to become
sometime higher than the locality of even the load/capacity

>
> Consider a 4 CPU system with 2 shared caches where CPU0's sched_domains
> look like:
>
> DIE: [0 1][2 3]
> MC: [0][1]
>
> Suppose CPUs 0 and 1 are equally idle wrt load and util, but CPU1 is in
> a shallower idle state. Suppose all CPUs are allowed.
>
> Assuming @cpu is 0, on the first iteration, find_idlest_group (fig) will
> currently return NULL meaning [0 1], then on the second iteration it
> will return NULL meaning [0].
>
> With this patch, on the first iteration fig will return [0 1], then
> find_idlest_group_cpu (figc) will return 1. Then on the second
> iteration, fig should return [1] (because it is biased towards the local
> group i.e. [@cpu] i.e. [1]). This is better because CPU1 is in a
> shallower idle state.
>
> So basically we're currently missing out on the biasing in figc, unless
> we at some point return a non-local group.
>
> Does that make sense?

With your previous example if the load/capacity of CPU0 is
significantly lower than CPU1, we will also select CPU1 instead of
CPU0:

@cpu is 0.
fig returns local group [0 1]
figc returns 1
@new_cpu != @cpu so we don't go to child

cpu = new cpu
sd = NULL
weight = 4
sd stays null because weight == tmp->span_weight (DIE)
we don't loop at lower level
fic returns CPU1

>
>>
>> Is there sched_domain topology where the sched_group of lowest level
>> is not with only 1 cpu ?
>>
>>>
>>> If that alternative solution was preferred, then some code could be
>>> removed in recognition of the fact that when find_idlest_group_cpu
>>> was called, @cpu would never be a member of @group (because @group
>>> would be a non-local group, and since @sd is derived from @cpu, @cpu
>>> would always be in @sd's local group).
>>>
>>> Signed-off-by: Brendan Jackman <brendan.jackman@xxxxxxx>
>>> Cc: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
>>> Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
>>> Cc: Josef Bacik <josef@xxxxxxxxxxxxxx>
>>> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
>>> Cc: Morten Rasmussen <morten.rasmussen@xxxxxxx>
>>> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>>> ---
>>> kernel/sched/fair.c | 29 ++++++++++-------------------
>>> 1 file changed, 10 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 26080917ff8d..93e2746d3c30 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5384,11 +5384,10 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
>>> * Assumes p is allowed on at least one CPU in sd.
>>> */
>>> static struct sched_group *
>>> -find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>>> - int this_cpu, int sd_flag)
>>> +find_idlest_group(struct sched_domain *sd, struct task_struct *p, int sd_flag)
>>> {
>>> - struct sched_group *idlest = NULL, *group = sd->groups;
>>> - struct sched_group *most_spare_sg = NULL;
>>> + struct sched_group *group = sd->groups, *local_group = sd->groups;
>>> + struct sched_group *idlest = NULL, *most_spare_sg = NULL;
>>> unsigned long min_runnable_load = ULONG_MAX;
>>> unsigned long this_runnable_load = ULONG_MAX;
>>> unsigned long min_avg_load = ULONG_MAX, this_avg_load = ULONG_MAX;
>>> @@ -5404,7 +5403,6 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>>> do {
>>> unsigned long load, avg_load, runnable_load;
>>> unsigned long spare_cap, max_spare_cap;
>>> - int local_group;
>>> int i;
>>>
>>> /* Skip over this group if it has no CPUs allowed */
>>> @@ -5412,9 +5410,6 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>>> &p->cpus_allowed))
>>> continue;
>>>
>>> - local_group = cpumask_test_cpu(this_cpu,
>>> - sched_group_span(group));
>>> -
>>> /*
>>> * Tally up the load of all CPUs in the group and find
>>> * the group containing the CPU with most spare capacity.
>>> @@ -5425,7 +5420,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>>>
>>> for_each_cpu(i, sched_group_span(group)) {
>>> /* Bias balancing toward cpus of our domain */
>>> - if (local_group)
>>> + if (group == local_group)
>>> load = source_load(i, load_idx);
>>> else
>>> load = target_load(i, load_idx);
>>> @@ -5446,7 +5441,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>>> runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) /
>>> group->sgc->capacity;
>>>
>>> - if (local_group) {
>>> + if (group == local_group) {
>>> this_runnable_load = runnable_load;
>>> this_avg_load = avg_load;
>>> this_spare = max_spare_cap;
>>> @@ -5492,21 +5487,21 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>>>
>>> if (this_spare > task_util(p) / 2 &&
>>> imbalance_scale*this_spare > 100*most_spare)
>>> - return NULL;
>>> + return local_group;
>>>
>>> if (most_spare > task_util(p) / 2)
>>> return most_spare_sg;
>>>
>>> skip_spare:
>>> if (!idlest)
>>> - return NULL;
>>> + return local_group;
>>>
>>> if (min_runnable_load > (this_runnable_load + imbalance))
>>> - return NULL;
>>> + return local_group;
>>>
>>> if ((this_runnable_load < (min_runnable_load + imbalance)) &&
>>> (100*this_avg_load < imbalance_scale*min_avg_load))
>>> - return NULL;
>>> + return local_group;
>>>
>>> return idlest;
>>> }
>>> @@ -5582,11 +5577,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
>>> continue;
>>> }
>>>
>>> - group = find_idlest_group(sd, p, cpu, sd_flag);
>>> - if (!group) {
>>> - sd = sd->child;
>>> - continue;
>>> - }
>>> + group = find_idlest_group(sd, p, sd_flag);
>>>
>>> new_cpu = find_idlest_group_cpu(group, p, cpu);
>>> if (new_cpu == cpu) {
>>> --
>>> 2.14.1
>>>
>