Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains

From: Valentin Schneider
Date: Wed Dec 18 2019 - 13:54:48 EST


Hi Mel,

On 18/12/2019 15:44, Mel Gorman wrote:
> The CPU load balancer balances between different domains to spread load
> and strives to have equal balance everywhere. Communicating tasks can
> migrate so they are topologically close to each other but these decisions
> are independent. On a lightly loaded NUMA machine, two communicating tasks
> pulled together at wakeup time can be pushed apart by the load balancer.
> In isolation, the load balancer decision is fine but it ignores the tasks
> data locality and the wakeup/LB paths continually conflict. NUMA balancing
> is also a factor but it also simply conflicts with the load balancer.
>
> This patch allows a degree of imbalance to exist between NUMA domains
> based on the imbalance_pct defined by the scheduler domain to take into
> account that data locality is also important. This slight imbalance is
> allowed until the scheduler domain reaches almost 50% utilisation at which
> point other factors like HT utilisation and memory bandwidth come into
> play. While not commented upon in the code, the cutoff is important for
> memory-bound parallelised non-communicating workloads that do not fully
> utilise the entire machine. This is not necessarily the best universal
> cut-off point but it appeared appropriate for a variety of workloads
> and machines.
>
> The most obvious impact is on netperf TCP_STREAM -- two simple
> communicating tasks with some softirq offloaded depending on the
> transmission rate.
>

<snip>

> In general, the patch simply seeks to avoid unnecessarily cross-node
> migrations when a machine is lightly loaded but shows benefits for other
> workloads. While tests are still running, so far it seems to benefit
> light-utilisation smaller workloads on large machines and does not appear
> to do any harm to larger or parallelised workloads.
>

Thanks for the detailed testing, I haven't digested it entirely yet but I
appreciate the effort.

> @@ -8690,6 +8686,38 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> env->migration_type = migrate_task;
> env->imbalance = max_t(long, 0, (local->idle_cpus -
> busiest->idle_cpus) >> 1);
> +
> +out_spare:
> + /*
> + * Whether balancing the number of running tasks or the number
> + * of idle CPUs, consider allowing some degree of imbalance if
> + * migrating between NUMA domains.
> + */
> + if (env->sd->flags & SD_NUMA) {
> + unsigned int imbalance_adj, imbalance_max;
> +
> + /*
> + * imbalance_adj is the allowable degree of imbalance
> + * to exist between two NUMA domains. It's calculated
> + * relative to imbalance_pct with a minimum of two
> + * tasks or idle CPUs.
> + */
> + imbalance_adj = (busiest->group_weight *
> + (env->sd->imbalance_pct - 100) / 100) >> 1;

IIRC imbalance_pct for NUMA domains uses the default 125, so I read this as
"allow an imbalance of 1 task per 8 CPU in the source group" (just making
sure I follow).

> + imbalance_adj = max(imbalance_adj, 2U);
> +
> + /*
> + * Ignore imbalance unless busiest sd is close to 50%
> + * utilisation. At that point balancing for memory
> + * bandwidth and potentially avoiding unnecessary use
> + * of HT siblings is as relevant as memory locality.
> + */
> + imbalance_max = (busiest->group_weight >> 1) - imbalance_adj;
> + if (env->imbalance <= imbalance_adj &&
> + busiest->sum_nr_running < imbalance_max) {

The code does "unless busiest group has half as many runnable tasks (or more)
as it has CPUs (modulo the adj thing)", is that what you mean by "unless
busiest sd is close to 50% utilisation" in the comment? It's somewhat
different IMO.

> + env->imbalance = 0;
> + }
> + }
> return;
> }
>
>

I'm quite sure you have reasons to have written it that way, but I was
hoping we could squash it down to something like:
---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 08a233e97a01..f05d09a8452e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8680,16 +8680,27 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
env->migration_type = migrate_task;
lsub_positive(&nr_diff, local->sum_nr_running);
env->imbalance = nr_diff >> 1;
- return;
+ } else {
+
+ /*
+ * If there is no overload, we just want to even the number of
+ * idle cpus.
+ */
+ env->migration_type = migrate_task;
+ env->imbalance = max_t(long, 0, (local->idle_cpus -
+ busiest->idle_cpus) >> 1);
}

/*
- * If there is no overload, we just want to even the number of
- * idle cpus.
+ * Allow for a small imbalance between NUMA groups; don't do any
+ * of it if there is at least half as many tasks / busy CPUs as
+ * there are available CPUs in the busiest group
*/
- env->migration_type = migrate_task;
- env->imbalance = max_t(long, 0, (local->idle_cpus -
- busiest->idle_cpus) >> 1);
+ if (env->sd->flags & SD_NUMA &&
+ (busiest->sum_nr_running < busiest->group_weight >> 1) &&
+ (env->imbalance < busiest->group_weight * (env->sd->imbalance_pct - 100) / 100))
+ env->imbalance = 0;
+
return;
}