Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

From: Jirka Hladky
Date: Mon May 18 2020 - 10:53:11 EST


Hi Hillf,

thanks a lot for your patch!

Compared to 5.7 rc4 vanilla, we observe the following:
* Single-tenant jobs show improvement up to 15% for SPECjbb2005 and
up to 100% for NAS in low thread mode. In other words, it fixes all
the problems we have reported in this thread.
* Multitenancy jobs show performance degradation up to 30% for SPECjbb2005

While it fixes problems with single-tenant jobs and with a performance
at low system load, it breaks multi-tenant tasks.

We have compared it against kernel with adjust_numa_imbalance disabled
[1], and both kernels perform at the same level for the single-tenant
jobs, but the proposed patch is bad for the multitenancy mode. The
kernel with adjust_numa_imbalance disabled is a clear winner here.

We would be very interested in what others think about disabling
adjust_numa_imbalance function. The patch is bellow. It would be great
to collect performance results for different scenarios to make sure
the results are objective.

Thanks a lot!
Jirka

[1] Patch to test kernel with adjust_numa_imbalance disabled:
===============================================
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02f323b..8c43d29 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8907,14 +8907,6 @@ static inline long adjust_numa_imbalance(int
imbalance, int src_nr_running)
{
unsigned int imbalance_min;

- /*
- * Allow a small imbalance based on a simple pair of communicating
- * tasks that remain local when the source domain is almost idle.
- */
- imbalance_min = 2;
- if (src_nr_running <= imbalance_min)
- return 0;
-
return imbalance;
}
===============================================





On Fri, May 8, 2020 at 5:47 AM Hillf Danton <hdanton@xxxxxxxx> wrote:
>
>
> On Thu, 7 May 2020 13:49:34 Phil Auld wrote:
> >
> > On Thu, May 07, 2020 at 06:29:44PM +0200 Jirka Hladky wrote:
> > > Hi Mel,
> > >
> > > we are not targeting just OMP applications. We see the performance
> > > degradation also for other workloads, like SPECjbb2005 and
> > > SPECjvm2008. Even worse, it also affects a higher number of threads.
> > > For example, comparing 5.7.0-0.rc2 against 5.6 kernel, on 4 NUMA
> > > server with 2x AMD 7351 CPU, we see performance degradation 22% for 32
> > > threads (the system has 64 CPUs in total). We observe this degradation
> > > only when we run a single SPECjbb binary. When running 4 SPECjbb
> > > binaries in parallel, there is no change in performance between 5.6
> > > and 5.7.
> > >
> > > That's why we are asking for the kernel tunable, which we would add to
> > > the tuned profile. We don't expect users to change this frequently but
> > > rather to set the performance profile once based on the purpose of the
> > > server.
> > >
> > > If you could prepare a patch for us, we would be more than happy to
> > > test it extensively. Based on the results, we can then evaluate if
> > > it's the way to go. Thoughts?
> > >
> >
> > I'm happy to spin up a patch once I'm sure what exactly the tuning would
> > effect. At an initial glance I'm thinking it would be the imbalance_min
> > which is currently hardcoded to 2. But there may be something else...
>
> hrm... try to restore the old behavior by skipping task migrate in favor
> of the local node if there is no imbalance.
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1928,18 +1928,16 @@ static void task_numa_find_cpu(struct ta
> src_running = env->src_stats.nr_running - 1;
> dst_running = env->dst_stats.nr_running + 1;
> imbalance = max(0, dst_running - src_running);
> - imbalance = adjust_numa_imbalance(imbalance, src_running);
> + imbalance = adjust_numa_imbalance(imbalance, src_running +1);
>
> - /* Use idle CPU if there is no imbalance */
> + /* No task migrate without imbalance */
> if (!imbalance) {
> - maymove = true;
> - if (env->dst_stats.idle_cpu >= 0) {
> - env->dst_cpu = env->dst_stats.idle_cpu;
> - task_numa_assign(env, NULL, 0);
> - return;
> - }
> + env->best_cpu = -1;
> + return;
> }
> - } else {
> + }
> +
> + do {
> long src_load, dst_load, load;
> /*
> * If the improvement from just moving env->p direction is better
> @@ -1949,7 +1947,7 @@ static void task_numa_find_cpu(struct ta
> dst_load = env->dst_stats.load + load;
> src_load = env->src_stats.load - load;
> maymove = !load_too_imbalanced(src_load, dst_load, env);
> - }
> + } while (0);
>
> for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
> /* Skip this CPU if the source task cannot migrate */
>
>


--
-Jirka