Re: [PATCH] sched/fair: Add SMT4 group_smt_balance handling

From: Shrikanth Hegde
Date: Mon Aug 07 2023 - 05:36:48 EST




On 7/27/23 7:02 PM, Tim Chen wrote:
> On Wed, 2023-07-26 at 20:11 -0700, Tim Chen wrote:
>> On Mon, 2023-07-17 at 20:28 +0530, Shrikanth Hegde wrote:
>>> From: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
>>>
>>> For SMT4, any group with more than 2 tasks will be marked as
>>> group_smt_balance. Retain the behaviour of group_has_spare by marking
>>> the busiest group as the group which has the least number of idle_cpus.
>>>
>>> Also, handle rounding effect of adding (ncores_local + ncores_busy)
>>> when the local is fully idle and busy group has more than 2 tasks.
>>> Local group should try to pull at least 1 task in this case.
>>>
>>> Originally-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
>>> Signed-off-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
>>> Signed-off-by: Shrikanth Hegde <sshegde@xxxxxxxxxxxxxxxxxx>
>>> ---
>>> kernel/sched/fair.c | 13 ++++++++++++-
>>> 1 file changed, 12 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 932e7b78894a..9502013abe33 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -9532,7 +9532,7 @@ static inline long sibling_imbalance(struct lb_env *env,
>>> imbalance /= ncores_local + ncores_busiest;
>>>
>>> /* Take advantage of resource in an empty sched group */
>>> - if (imbalance == 0 && local->sum_nr_running == 0 &&
>>> + if (imbalance <= 1 && local->sum_nr_running == 0 &&
>>> busiest->sum_nr_running > 1)
>>> imbalance = 2;
>>>
>>> @@ -9720,6 +9720,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>> break;
>>>
>>> case group_smt_balance:
>>> + /* no idle cpus on both groups handled by group_fully_busy below */
>>> + if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
>>> + if (sgs->idle_cpus > busiest->idle_cpus)
>>> + return false;
>>> + if (sgs->idle_cpus < busiest->idle_cpus)
>>> + return true;
>>> + if (sgs->sum_nr_running <= busiest->sum_nr_running)
>>> + return false;
>>> + }
>>> + break;
>
> Shrikanth and Peter,
>
> Sorry, I acked Shrikanth's fixup patch too quickly without seeing that Shrikanth added
> a "break" in the patch above. My original code did not have that break statement as
> I did intend the code to fall through to the "group_fully_busy" code path when
> there are no idle cpus in both groups. To make the compiler happy and putting
> in the correct logic, I refresh the patch as below.
>
> Thanks.
>
> Tim
>
> From: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
> Date: Fri, 14 Jul 2023 16:09:30 -0700
> Subject: [PATCH] sched/fair: Add SMT4 group_smt_balance handling
>
> For SMT4, any group with more than 2 tasks will be marked as
> group_smt_balance. Retain the behaviour of group_has_spare by marking
> the busiest group as the group which has the least number of idle_cpus.
>
> Also, handle rounding effect of adding (ncores_local + ncores_busy)
> when the local is fully idle and busy group has more than 2 tasks.
> Local group should try to pull at least 1 task in this case.
>
> Signed-off-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
> ---
> kernel/sched/fair.c | 18 ++++++++++++++++--
> 1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a87988327f88..566686c5f2bd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9563,7 +9563,7 @@ static inline long sibling_imbalance(struct lb_env *env,
> imbalance /= ncores_local + ncores_busiest;
>
> /* Take advantage of resource in an empty sched group */
> - if (imbalance == 0 && local->sum_nr_running == 0 &&
> + if (imbalance <= 1 && local->sum_nr_running == 0 &&
> busiest->sum_nr_running > 1)
> imbalance = 2;
>
> @@ -9751,6 +9751,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> break;
>
> case group_smt_balance:
> + /* no idle cpus on both groups handled by group_fully_busy below */
> + if (sgs->idle_cpus != 0 || busiest->idle_cpus != 0) {
> + if (sgs->idle_cpus > busiest->idle_cpus)
> + return false;
> + if (sgs->idle_cpus < busiest->idle_cpus)
> + return true;
> + if (sgs->sum_nr_running <= busiest->sum_nr_running)
> + return false;
> + else
> + return true;
> + }
> + goto fully_busy;
> + break;
> +
> case group_fully_busy:
> /*
> * Select the fully busy group with highest avg_load. In
> @@ -9763,7 +9777,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> * select the 1st one, except if @sg is composed of SMT
> * siblings.
> */
> -
> +fully_busy:
> if (sgs->avg_load < busiest->avg_load)
> return false;
>

Hi Tim, Peter.

group_smt_balance(cluster scheduling), patches are in tip/sched/core. I dont
see this above patch there yet. Currently as is, this can cause function difference
in SMT4 systems( such as Power10).

Can we please have the above patch as well in tip/sched/core?

Acked-by: Shrikanth Hegde <sshegde@xxxxxxxxxxxxxxxxxx>