Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains

From: Valentin Schneider
Date: Thu Dec 19 2019 - 06:46:14 EST


On 19/12/2019 10:02, Peter Zijlstra wrote:
> On Wed, Dec 18, 2019 at 06:50:52PM +0000, Valentin Schneider wrote:
>> I'm quite sure you have reasons to have written it that way, but I was
>> hoping we could squash it down to something like:
>> ---
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 08a233e97a01..f05d09a8452e 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8680,16 +8680,27 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>> env->migration_type = migrate_task;
>> lsub_positive(&nr_diff, local->sum_nr_running);
>> env->imbalance = nr_diff >> 1;
>> - return;
>> + } else {
>> +
>> + /*
>> + * If there is no overload, we just want to even the number of
>> + * idle cpus.
>> + */
>> + env->migration_type = migrate_task;
>> + env->imbalance = max_t(long, 0, (local->idle_cpus -
>> + busiest->idle_cpus) >> 1);
>> }
>>
>> /*
>> - * If there is no overload, we just want to even the number of
>> - * idle cpus.
>> + * Allow for a small imbalance between NUMA groups; don't do any
>> + * of it if there is at least half as many tasks / busy CPUs as
>> + * there are available CPUs in the busiest group
>> */
>> - env->migration_type = migrate_task;
>> - env->imbalance = max_t(long, 0, (local->idle_cpus -
>> - busiest->idle_cpus) >> 1);
>> + if (env->sd->flags & SD_NUMA &&
>> + (busiest->sum_nr_running < busiest->group_weight >> 1) &&
>> + (env->imbalance < busiest->group_weight * (env->sd->imbalance_pct - 100) / 100))
>
> Note that this form allows avoiding the division. Every time I see that
> /100 I'm thinking we should rename and make imbalance_pct a base-2
> thing.
>

Right, I kept the original form but we can turn that into

env->imbalance * 100 < busiest->group_weight * (env->sd->imbalance_pct - 100)



As for the base-2 imbalance; I think you've mentioned that in the past.
Looking at check_cpu_capacity() as a lambda imbalance_pct user, we could
turn that from:

rq->cpu_capacity * sd->imbalance_pct < rq->cpu_capacity_orig * 100

to:

rq->cpu_capacity_orig - rq->cpu_capacity < rq->cpu_capacity_orig >> sd->imbalance_shift


And here we could just go with

env->imbalance < busiest->group_weight >> sd->imbalance_shift


As for picking values, right now we have

125 (default) / 117 (LLC domain) / 110 (SMT domain)

We could have

>> 2 (25%), >> 3 (12.5%), >> 4 (6.25%).

It's not strictly equivalent but IMO the whole imbalance_pct thing isn't
very precise anyway; just needs to be good enough on a sufficient number of
topologies.



>> + env->imbalance = 0;
>> +
>> return;
>> }
>>