Re: [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group
From: Brendan Jackman
Date: Tue Aug 22 2017 - 06:45:44 EST
On Tue, Aug 22 2017 at 10:39, Brendan Jackman wrote:
> On Tue, Aug 22 2017 at 04:34, Joel Fernandes wrote:
>> Hi Peter,
>>
>> On Mon, Aug 21, 2017 at 2:14 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>>> On Mon, Aug 21, 2017 at 04:21:28PM +0100, Brendan Jackman wrote:
>>>> The current use of returning NULL from find_idlest_group is broken in
>>>> two cases:
>>>>
>>>> a1) The local group is not allowed.
>>>>
>>>> In this case, we currently do not change this_runnable_load or
>>>> this_avg_load from its initial value of 0, which means we return
>>>> NULL regardless of the load of the other, allowed groups. This
>>>> results in pointlessly continuing the find_idlest_group search
>>>> within the local group and then returning prev_cpu from
>>>> select_task_rq_fair.
>>>
>>>> b) smp_processor_id() is the "idlest" and != prev_cpu.
>>>>
>>>> find_idlest_group also returns NULL when the local group is
>>>> allowed and is the idlest. The caller then continues the
>>>> find_idlest_group search at a lower level of the current CPU's
>>>> sched_domain hierarchy. However new_cpu is not updated. This means
>>>> the search is pointless and we return prev_cpu from
>>>> select_task_rq_fair.
>>>>
>>>
>>> I think its much simpler than that.. but its late, so who knows ;-)
>>>
>>> Both cases seem predicated on the assumption that we'll return @cpu when
>>> we don't find any idler CPU. Consider, if the local group is the idlest,
>>> we should stick with @cpu and simply proceed with the child domain.
>>>
>>> The confusion, and the bugs, seem to have snuck in when we started
>>> considering @prev_cpu, whenever that was. The below is mostly code
>>> movement to put that whole while(sd) loop into its own function.
>>>
>>> The effective change is setting @new_cpu = @cpu when we start that loop:
>>>
>> <snip>
>>> ---
>>> kernel/sched/fair.c | 83 +++++++++++++++++++++++++++++++----------------------
>>> 1 file changed, 48 insertions(+), 35 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index c77e4b1d51c0..3e77265c480a 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5588,10 +5588,10 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
>>> }
>>>
>>> /*
>>> - * find_idlest_cpu - find the idlest cpu among the cpus in group.
>>> + * find_idlest_group_cpu - find the idlest cpu among the cpus in group.
>>> */
>>> static int
>>> -find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>>> +find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>>> {
>>> unsigned long load, min_load = ULONG_MAX;
>>> unsigned int min_exit_latency = UINT_MAX;
>>> @@ -5640,6 +5640,50 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
>>> return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu;
>>> }
>>>
>>> +static int
>>> +find_idlest_cpu(struct sched_domain *sd, struct task_struct *p, int cpu, int sd_flag)
>>> +{
>>> + struct sched_domain *tmp;
>>> + int new_cpu = cpu;
>>> +
>>> + while (sd) {
>>> + struct sched_group *group;
>>> + int weight;
>>> +
>>> + if (!(sd->flags & sd_flag)) {
>>> + sd = sd->child;
>>> + continue;
>>> + }
>>> +
>>> + group = find_idlest_group(sd, p, cpu, sd_flag);
>>> + if (!group) {
>>> + sd = sd->child;
>>> + continue;
>>
>> But this will still have the issue of pointlessly searching in the
>> local_group when the idlest CPU is in the non-local group? Stemming
>> from the fact that find_idlest_group is broken if the local group is
>> not allowed.
>>
>> I believe this is fixed by Brendan's patch? :
>>
>> "Initializing this_runnable_load and this_avg_load to ULONG_MAX
>> instead of 0. This means in case a1) we now return the idlest
>> non-local group."
>>
>
> Yeah, I don't think this is enough, firstly because of affinity and
> secondly because I _guess_ we don't want to migrate from prev_cpu to
> @cpu if none of the sched_domains have sd_flag set (...right? That would
> feel like wake balancing even when there's no SD_BALANCE_WAKE anywhere).
Hmm, actually we always have sd_flag set, because otherwise we wouldn't
have done sd = tmp at the beginning of select_task_rq_fair.
>
> However the code movement helps - I'll combine it with Vincent's
> suggestions and post a v2.