Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

From: Mel Gorman
Date: Fri Mar 20 2020 - 12:38:49 EST


On Fri, Mar 20, 2020 at 04:30:08PM +0100, Jirka Hladky wrote:
> >
> > MPI or OMP and what is a low thread count? For MPI at least, I saw a 0.4%
> > gain on an 4-node machine for bt_C and a 3.88% regression on 8-nodes. I
> > think it must be OMP you are using because I found I had to disable UA
> > for MPI at some point in the past for reasons I no longer remember.
>
>
> Yes, it's indeed OMP. With low threads count, I mean up to 2x number of
> NUMA nodes (8 threads on 4 NUMA node servers, 16 threads on 8 NUMA node
> servers).
>

Ok, so we know it's within the imbalance threshold where a NUMA node can
be left idle.

> One possibility would be to spread wide always at clone time and assume
> > wake_affine will pull related tasks but it's fragile because it breaks
> > if the cloned task execs and then allocates memory from a remote node
> > only to migrate to a local node immediately.
>
>
> I think the only way to find out how it performs is to test it. If you
> could prepare a patch like that, I'm more than happy to give it a try!
>

When the initial spreading was prevented, it was for pipelines mainly --
even basic shell scripts. In that case it was observed that a shell would
fork/exec two tasks connected via pipe that started on separate nodes and
had allocated remote data before being pulled close. The processes were
typically too short lived for NUMA balancing to fix it up by exec time
the information on where the fork happened was lost. See 2c83362734da
("sched/fair: Consider SD_NUMA when selecting the most idle group to
schedule on"). Now the logic has probably been partially broken since
because of how SD_NUMA is now treated but the concern about spreading
wide prematurely remains.

--
Mel Gorman
SUSE Labs