Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

From: Jirka Hladky
Date: Thu May 07 2020 - 12:30:03 EST


Hi Mel,

we are not targeting just OMP applications. We see the performance
degradation also for other workloads, like SPECjbb2005 and
SPECjvm2008. Even worse, it also affects a higher number of threads.
For example, comparing 5.7.0-0.rc2 against 5.6 kernel, on 4 NUMA
server with 2x AMD 7351 CPU, we see performance degradation 22% for 32
threads (the system has 64 CPUs in total). We observe this degradation
only when we run a single SPECjbb binary. When running 4 SPECjbb
binaries in parallel, there is no change in performance between 5.6
and 5.7.

That's why we are asking for the kernel tunable, which we would add to
the tuned profile. We don't expect users to change this frequently but
rather to set the performance profile once based on the purpose of the
server.

If you could prepare a patch for us, we would be more than happy to
test it extensively. Based on the results, we can then evaluate if
it's the way to go. Thoughts?

Thanks a lot!
Jirka

On Thu, May 7, 2020 at 5:54 PM Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Thu, May 07, 2020 at 05:24:17PM +0200, Jirka Hladky wrote:
> > Hi Mel,
> >
> > > > Yes, it's indeed OMP. With low threads count, I mean up to 2x number of
> > > > NUMA nodes (8 threads on 4 NUMA node servers, 16 threads on 8 NUMA node
> > > > servers).
> > >
> > > Ok, so we know it's within the imbalance threshold where a NUMA node can
> > > be left idle.
> >
> > we have discussed today with my colleagues the performance drop for
> > some workloads for low threads counts (roughly up to 2x number of NUMA
> > nodes). We are worried that it can be a severe issue for some use
> > cases, which require a full memory bandwidth even when only part of
> > CPUs is used.
> >
> > We understand that scheduler cannot distinguish this type of workload
> > from others automatically. However, there was an idea for a * new
> > kernel tunable to control the imbalance threshold *. Based on the
> > purpose of the server, users could set this tunable. See the tuned
> > project, which allows creating performance profiles [1].
> >
>
> I'm not completely opposed to it but given that the setting is global,
> I imagine it could have other consequences if two applications ran
> at different times have different requirements. Given that it's OMP,
> I would have imagined that an application that really cared about this
> would specify what was needed using OMP_PLACES. Why would someone prefer
> kernel tuning or a tuned profile over OMP_PLACES? After all, it requires
> specific knowledge of the application even to know that a particular
> tuned profile is needed.
>
> --
> Mel Gorman
> SUSE Labs
>


--
-Jirka