Re: [PATCH] sched/fair: Add SMT4 group_smt_balance handling

From: Shrikanth Hegde
Date: Thu Sep 07 2023 - 11:39:27 EST




On 9/6/23 12:07 AM, Tim Chen wrote:
> On Tue, 2023-09-05 at 13:33 +0530, Shrikanth Hegde wrote:
>>
>> On 8/22/23 12:49 AM, Tim Chen wrote:
>>> On Mon, 2023-08-07 at 15:06 +0530, Shrikanth Hegde wrote:
>>>>>
>>>>> From: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
>>>>> Date: Fri, 14 Jul 2023 16:09:30 -0700
>>>>> Subject: [PATCH] sched/fair: Add SMT4 group_smt_balance handling
>>>>>
>>>>> For SMT4, any group with more than 2 tasks will be marked as
>>>>> group_smt_balance. Retain the behaviour of group_has_spare by marking
>>>>> the busiest group as the group which has the least number of idle_cpus.
>>>>>
>>>>> Also, handle rounding effect of adding (ncores_local + ncores_busy)
>>>>> when the local is fully idle and busy group has more than 2 tasks.
>>>>> Local group should try to pull at least 1 task in this case.
>>>>>
>>>>> Signed-off-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
>>>>> ---
>>>>> kernel/sched/fair.c | 18 ++++++++++++++++--
>>>>> 1 file changed, 16 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index a87988327f88..566686c5f2bd 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -9563,7 +9563,7 @@ static inline long sibling_imbalance(struct lb_env *env,
>>>>> imbalance /= ncores_local + ncores_busiest;
>>>>>
>>>>> /* Take advantage of resource in an empty sched group */
>>>>> - if (imbalance == 0 && local->sum_nr_running == 0 &&
>>>>> + if (imbalance <= 1 && local->sum_nr_running == 0 &&
>>>>> busiest->sum_nr_running > 1)
>>>>> imbalance = 2;
>>>>>
>>>>> @@ -9751,6 +9751,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>>>> break;
>>>>>
>>>>> case group_smt_balance:
>>>>> + /* no idle cpus on both groups handled by group_fully_busy below */
>>>>> + if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
>>>>> + if (sgs->idle_cpus > busiest->idle_cpus)
>>>>> + return false;
>>>>> + if (sgs->idle_cpus < busiest->idle_cpus)
>>>>> + return true;
>>>>> + if (sgs->sum_nr_running <= busiest->sum_nr_running)
>>>>> + return false;
>>>>> + else
>>>>> + return true;
>>>>> + }
>>>>> + goto fully_busy;
>>>>> + break;
>>>>> +
>>>>> case group_fully_busy:
>>>>> /*
>>>>> * Select the fully busy group with highest avg_load. In
>>>>> @@ -9763,7 +9777,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>>>> * select the 1st one, except if @sg is composed of SMT
>>>>> * siblings.
>>>>> */
>>>>> -
>>>>> +fully_busy:
>>>>> if (sgs->avg_load < busiest->avg_load)
>>>>> return false;
>>>>>
>>>>
>>>> Hi Tim, Peter.
>>>>
>>>> group_smt_balance(cluster scheduling), patches are in tip/sched/core. I dont
>>>> see this above patch there yet. Currently as is, this can cause function difference
>>>> in SMT4 systems( such as Power10).
>>>>
>>>> Can we please have the above patch as well in tip/sched/core?
>>>>
>>>> Acked-by: Shrikanth Hegde <sshegde@xxxxxxxxxxxxxxxxxx>
>>>
>>> Hi Peter,
>>>
>>> Just back from my long vacation. Wonder if you have any comments on the above patch
>>> for fixing the SMT4 case?
>>>
>>> Tim
>>
>> Hi Tim, Peter.
>>
>> are there any concerns with the above patch for fixing the SMT4 case.
>> Currently the behavior is group_smt_balance is set for having even 2 tasks in
>> SMT4, ideally it should be same as the group_has_spare.
>>
>> The above patch copies the same behavior to group_smt_balance.
>>>
>
> You mean simplify the patch as below? I think that should be fine. Can you
> make sure it works for SMT4? And I can update the patch once you confirm it
> works properly.
>
> Tim
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6e7ee2efc1ba..48e9ab7f8a87 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9764,16 +9764,9 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>
> case group_smt_balance:
> /* no idle cpus on both groups handled by group_fully_busy below */
> - if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
> - if (sgs->idle_cpus > busiest->idle_cpus)
> - return false;
> - if (sgs->idle_cpus < busiest->idle_cpus)
> - return true;
> - if (sgs->sum_nr_running <= busiest->sum_nr_running)
> - return false;
> - else
> - return true;
> - }
> + if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0)
> + goto has_spare;
> +
> goto fully_busy;
>
> case group_fully_busy:
> @@ -9809,6 +9802,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> * as we do not want to pull task off SMT core with one task
> * and make the core idle.
> */
> +has_spare:
> if (smt_vs_nonsmt_groups(sds->busiest, sg)) {
> if (sg->flags & SD_SHARE_CPUCAPACITY && sgs->sum_h_nr_running <= 1)
> return false;
>
>
>

Hi Tim,

In case you were waiting for my reply as inferred from other email.
The above change looks fine as well. This would avoid duplication of
code for group_smt_balance.

Acked-by: Shrikanth Hegde <sshegde@xxxxxxxxxxxxxxxxxx>