Re: [PATCH 3/4] sched/numa: Allow a floating imbalance between NUMA nodes

From: Vincent Guittot
Date: Fri Nov 20 2020 - 08:33:36 EST


On Fri, 20 Nov 2020 at 10:06, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
>
> Currently, an imbalance is only allowed when a destination node
> is almost completely idle. This solved one basic class of problems
> and was the cautious approach.
>
> This patch revisits the possibility that NUMA nodes can be imbalanced
> until 25% of the CPUs are occupied. The reasoning behind 25% is somewhat
> superficial -- it's half the cores when HT is enabled. At higher
> utilisations, balancing should continue as normal and keep things even
> until scheduler domains are fully busy or over utilised.
>
> Note that this is not expected to be a universal win. Any benchmark
> that prefers spreading as wide as possible with limited communication
> will favour the old behaviour as there is more memory bandwidth.
> Workloads that communicate heavily in pairs such as netperf or tbench
> benefit. For the tests I ran, the vast majority of workloads saw
> a benefit so it seems to be a worthwhile trade-off.
>
> Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>

Reviewed-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>

> ---
> kernel/sched/fair.c | 21 +++++++++++----------
> 1 file changed, 11 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9aded12aaa90..e17e6c5da1d5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1550,7 +1550,8 @@ struct task_numa_env {
> static unsigned long cpu_load(struct rq *rq);
> static unsigned long cpu_runnable(struct rq *rq);
> static unsigned long cpu_util(int cpu);
> -static inline long adjust_numa_imbalance(int imbalance, int dst_running);
> +static inline long adjust_numa_imbalance(int imbalance,
> + int dst_running, int dst_weight);
>
> static inline enum
> numa_type numa_classify(unsigned int imbalance_pct,
> @@ -1930,7 +1931,8 @@ static void task_numa_find_cpu(struct task_numa_env *env,
> src_running = env->src_stats.nr_running - 1;
> dst_running = env->dst_stats.nr_running + 1;
> imbalance = max(0, dst_running - src_running);
> - imbalance = adjust_numa_imbalance(imbalance, dst_running);
> + imbalance = adjust_numa_imbalance(imbalance, dst_running,
> + env->dst_stats.weight);
>
> /* Use idle CPU if there is no imbalance */
> if (!imbalance) {
> @@ -8995,16 +8997,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>
> #define NUMA_IMBALANCE_MIN 2
>
> -static inline long adjust_numa_imbalance(int imbalance, int dst_running)
> +static inline long adjust_numa_imbalance(int imbalance,
> + int dst_running, int dst_weight)
> {
> - unsigned int imbalance_min;
> -
> /*
> * Allow a small imbalance based on a simple pair of communicating
> - * tasks that remain local when the source domain is almost idle.
> + * tasks that remain local when the destination is lightly loaded.
> */
> - imbalance_min = NUMA_IMBALANCE_MIN;
> - if (dst_running <= imbalance_min)
> + if (dst_running < (dst_weight >> 2) && imbalance <= NUMA_IMBALANCE_MIN)
> return 0;
>
> return imbalance;
> @@ -9107,9 +9107,10 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> }
>
> /* Consider allowing a small imbalance between NUMA groups */
> - if (env->sd->flags & SD_NUMA)
> + if (env->sd->flags & SD_NUMA) {
> env->imbalance = adjust_numa_imbalance(env->imbalance,
> - busiest->sum_nr_running);
> + busiest->sum_nr_running, busiest->group_weight);
> + }
>
> return;
> }
> --
> 2.26.2
>