Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

From: Jirka Hladky
Date: Fri Mar 20 2020 - 11:34:01 EST


> MPI or OMP and what is a low thread count? For MPI at least, I saw a 0.4%
> gain on an 4-node machine for bt_C and a 3.88% regression on 8-nodes. I
> think it must be OMP you are using because I found I had to disable UA
> for MPI at some point in the past for reasons I no longer remember.

Yes, it's indeed OMP. With low threads count, I mean up to 2x number
of NUMA nodes (8 threads on 4 NUMA node servers, 16 threads on 8 NUMA
node servers).

> One possibility would be to spread wide always at clone time and assume
> wake_affine will pull related tasks but it's fragile because it breaks
> if the cloned task execs and then allocates memory from a remote node
> only to migrate to a local node immediately.

I think the only way to find out how it performs is to test it. If you
could prepare a patch like that, I'm more than happy to give it a try!

Jirka


On Fri, Mar 20, 2020 at 4:22 PM Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Fri, Mar 20, 2020 at 03:37:44PM +0100, Jirka Hladky wrote:
> > Hi Mel,
> >
> > just a quick update. I have increased the testing coverage and other tests
> > from the NAS shows a big performance drop for the low number of threads as
> > well:
> >
> > sp_C_x - show still the biggest drop upto 50%
> > bt_C_x - performance drop upto 40%
> > ua_C_x - performance drop upto 30%
> >
>
> MPI or OMP and what is a low thread count? For MPI at least, I saw a 0.4%
> gain on an 4-node machine for bt_C and a 3.88% regression on 8-nodes. I
> think it must be OMP you are using because I found I had to disable UA
> for MPI at some point in the past for reasons I no longer remember.
>
> > My point is that the performance drop for the low number of threads is more
> > common than we have initially thought.
> >
> > Let me know what you need more data.
> >
>
> I just a clarification on the thread count and a confirmation it's OMP. For
> MPI, I did note that some of the other NAS kernels shows a slight dip but
> it was nowhere near as severe as SP and the problem was the same as more --
> two or more tasks stayed on the same node without spreading out because
> there was no pressure to do so. There was enough CPU and memory capacity
> with no obvious pattern that could be used to spread the load wide early.
>
> One possibility would be to spread wide always at clone time and assume
> wake_affine will pull related tasks but it's fragile because it breaks
> if the cloned task execs and then allocates memory from a remote node
> only to migrate to a local node immediately.
>
> --
> Mel Gorman
> SUSE Labs
>


--
-Jirka