Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains

From: Valentin Schneider
Date: Thu Dec 19 2019 - 10:16:32 EST


On 19/12/2019 14:45, Vincent Guittot wrote:
>> @@ -8680,7 +8676,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>> env->migration_type = migrate_task;
>> lsub_positive(&nr_diff, local->sum_nr_running);
>> env->imbalance = nr_diff >> 1;
>> - return;
>> + goto out_spare;
>
> Why are you doing this only for prefer_sibling case ? That's probably the default case of most of numa system but you should also consider others case too.
>

I got confused by that as well but it's not just prefer_sibling actually;
there are cases where we enter the group_has_spare but none of its
nested if blocks, so we fall through to out_spare.

>> @@ -8690,6 +8686,38 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>> env->migration_type = migrate_task;
>> env->imbalance = max_t(long, 0, (local->idle_cpus -
>> busiest->idle_cpus) >> 1);
>> +
>> +out_spare:
>> + /*
>> + * Whether balancing the number of running tasks or the number
>> + * of idle CPUs, consider allowing some degree of imbalance if
>> + * migrating between NUMA domains.
>> + */
>> + if (env->sd->flags & SD_NUMA) {
>> + unsigned int imbalance_adj, imbalance_max;
>> +
>> + /*
>> + * imbalance_adj is the allowable degree of imbalance
>> + * to exist between two NUMA domains. It's calculated
>> + * relative to imbalance_pct with a minimum of two
>> + * tasks or idle CPUs.
>> + */
>> + imbalance_adj = (busiest->group_weight *
>> + (env->sd->imbalance_pct - 100) / 100) >> 1;
>> + imbalance_adj = max(imbalance_adj, 2U);
>> +
>> + /*
>> + * Ignore imbalance unless busiest sd is close to 50%
>> + * utilisation. At that point balancing for memory
>> + * bandwidth and potentially avoiding unnecessary use
>> + * of HT siblings is as relevant as memory locality.
>> + */
>> + imbalance_max = (busiest->group_weight >> 1) - imbalance_adj;
>> + if (env->imbalance <= imbalance_adj &&
>> + busiest->sum_nr_running < imbalance_max) {i
>
> Shouldn't you consider the number of busiest->idle_cpus instead of the busiest->sum_nr_running ?
>

I think it's better to hinge the cutoff on the busiest->sum_nr_running than
on busiest->idle_cpus. If you're balancing between big NUMA groups, you
could end up with a busiest->group_type == group_has_spare despite having
*some* of its CPUs overloaded (but still with
sg->sum_nr_running > sg->group_weight; simply because there's tons of CPUs).

> and you could simplify byÂ
>
>
> if ((env->sd->flags & SD_NUMA) &&
> ((100 * busiest->group_weight) <= (env->sd->imbalance_pct * (busiest->idle_cpus << 1)))) {
> env->imbalance = 0;
> return;
> }
>
> And otherwise it will continue with the current path
>
> Also I'm a bit worry about using a 50% threshold that look a bit like a
> heuristic which can change depending of platform and the UCs that run of the
> system.
>
> In fact i was hoping that we could use the numa_preferred_nid ? During the
> detach of tasks, we don't detach the task if busiest has spare capacity and
> preferred_nid of the task is busiest.
>
> I'm going to run some tests to see the impact on my platform
>
> Regards,
> Vincent
> }
>
>
>> + env->imbalance = 0;
>> + }
>> + }
>> return;
>> }
>>
>>
>> --
>> Mel Gorman
>> SUSE Labs