Re: [PATCH] sched/fair: optimize should_we_balance for higher SMT systems

From: Shrikanth Hegde
Date: Tue Sep 05 2023 - 22:41:41 EST




On 9/6/23 1:00 AM, Tim Chen wrote:
> On Sat, 2023-09-02 at 13:42 +0530, Shrikanth Hegde wrote:
>>
>>
>> Fixes: b1bfeab9b002 ("sched/fair: Consider the idle state of the whole core for load balance")
>> Signed-off-by: Shrikanth Hegde <sshegde@xxxxxxxxxxxxxxxxxx>
>> ---
>> kernel/sched/fair.c | 15 ++++++++++++++-
>> 1 file changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 0b7445cd5af9..6e31923293bb 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6619,6 +6619,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> /* Working cpumask for: load_balance, load_balance_newidle. */
>> static DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
>> static DEFINE_PER_CPU(cpumask_var_t, select_rq_mask);
>> +static DEFINE_PER_CPU(cpumask_var_t, should_we_balance_tmpmask);
>>
>> #ifdef CONFIG_NO_HZ_COMMON
>>
>> @@ -10913,6 +10914,7 @@ static int active_load_balance_cpu_stop(void *data);
>>
>> static int should_we_balance(struct lb_env *env)
>> {
>> + struct cpumask *swb_cpus = this_cpu_cpumask_var_ptr(should_we_balance_tmpmask);
>> struct sched_group *sg = env->sd->groups;
>> int cpu, idle_smt = -1;
>>
>> @@ -10936,8 +10938,9 @@ static int should_we_balance(struct lb_env *env)
>> return 1;
>> }
>>
>> + cpumask_copy(swb_cpus, group_balance_mask(sg));
>> /* Try to find first idle CPU */
>> - for_each_cpu_and(cpu, group_balance_mask(sg), env->cpus) {
>> + for_each_cpu_and(cpu, swb_cpus, env->cpus) {
>> if (!idle_cpu(cpu))
>> continue;
>>
>> @@ -10949,6 +10952,14 @@ static int should_we_balance(struct lb_env *env)
>> if (!(env->sd->flags & SD_SHARE_CPUCAPACITY) && !is_core_idle(cpu)) {
>> if (idle_smt == -1)
>> idle_smt = cpu;
>> + /*
>> + * If the core is not idle, and first SMT sibling which is
>> + * idle has been found, then its not needed to check other
>> + * SMT siblings for idleness
>> + */
>> +#ifdef CONFIG_SCHED_SMT
>> + cpumask_andnot(swb_cpus, swb_cpus, cpu_smt_mask(cpu));
>> +#endif
>> continue;
>> }
>>
>> @@ -12914,6 +12925,8 @@ __init void init_sched_fair_class(void)
>> for_each_possible_cpu(i) {
>> zalloc_cpumask_var_node(&per_cpu(load_balance_mask, i), GFP_KERNEL, cpu_to_node(i));
>> zalloc_cpumask_var_node(&per_cpu(select_rq_mask, i), GFP_KERNEL, cpu_to_node(i));
>> + zalloc_cpumask_var_node(&per_cpu(should_we_balance_tmpmask, i),
>> + GFP_KERNEL, cpu_to_node(i));
>
> Shrianth,
>

Hi Tim,

Thanks for taking a look at this patch.

> Wonder if we can avoid allocating the
> should_we_balance_tmpmask for SMT2 case to save memory
> for system with large number of cores.
>
> The new mask and logic I think is only needed for more than 2 threads in a core.

Code would have to be refactored quite a bit if one needs to take
different approach for specific SMT setting.

I think there would some cases in SMT2 that will benefit as well.
Lets say 1 cpu in each core is busy. the busy CPU happens to be second
CPU in the core. In that case, this approach would skip that instead of
checking if that is idle or not.

>
> Tim
>>
>> #ifdef CONFIG_CFS_BANDWIDTH
>> INIT_CSD(&cpu_rq(i)->cfsb_csd, __cfsb_csd_unthrottle, cpu_rq(i));
>> --
>> 2.31.1
>>
>