Re: [PATCH v3 0/4] Revisit NUMA imbalance tolerance and fork balancing

From: Mel Gorman
Date: Fri Nov 20 2020 - 09:07:08 EST


On Fri, Nov 20, 2020 at 01:58:11PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 20, 2020 at 09:06:26AM +0000, Mel Gorman wrote:
>
> > Mel Gorman (4):
> > sched/numa: Rename nr_running and break out the magic number
> > sched: Avoid unnecessary calculation of load imbalance at clone time
> > sched/numa: Allow a floating imbalance between NUMA nodes
> > sched: Limit the amount of NUMA imbalance that can exist at fork time
> >
> > kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++-------------
> > 1 file changed, 31 insertions(+), 13 deletions(-)
>
> OK, lets give this another go :-)
>

Weeeeeeeee!

My expectations are that NAS will show some glitches depending on the
subtest, core usage and whether the subtest prefers packing closely or
spreading wide for both patch 3 and 4. I'm not *too* concerned about
that as HPC workloads are more likely to specify "places" be it OMP or
MPI. Ordinarily I would disagree with myself as NAS has been used as one
standard for scheduler behaviour and NUMA balancing in particular but
it favours allowing communicating tasks to remain local while spreading
for memory bandwidth when the busy CPUs is higher. I think that's a
reasonable balance.

In this case the main motive for patch 4 is the "real workload" that
is both memory and CPU intensive, runs on large machines and is latency
sensitive. I'm favouring the real workload over being able to pick a NAS
configuration that would show a regression.

The main negative corner case I'm anticipating is parallel loads (like
NAS), that are not memory bound, and the degree of parrallelisation is
between 25% and 100% of one node's compute capacity. Once it passes the
25% threshold, it'll start geting spread and may manifest as a regression
with patch 3 contributing slightly and patch 4 contributing a lot to
a regression.

Communicating tasks like tbench with varying thread counts will show
minor gains and losses depending on thread count.

Single communciators like netperf or perfpipe should show be ok or at
least within noise.

Hackbench should be fine because it typically saturates the machine so
any glitches there will likely be due to timing artifacts on initial
placement during clone.

Putting the predictions in writing should summon the regression demons
faster to prove me wrong due to Murphy's Law :P

Thanks.

--
Mel Gorman
SUSE Labs