Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

From: Mel Gorman
Date: Thu May 14 2020 - 05:51:00 EST


On Wed, May 13, 2020 at 06:20:53PM +0200, Jirka Hladky wrote:
> Thank you, Mel!
>
> I think I have to make sure we cover the scenario you have targeted
> when developing adjust_numa_imbalance:
>
> =======================================================================
> https://github.com/torvalds/linux/blob/4f8a3cc1183c442daee6cc65360e3385021131e4/kernel/sched/fair.c#L8910
>
> /*
> * Allow a small imbalance based on a simple pair of communicating
> * tasks that remain local when the source domain is almost idle.
> */
> =======================================================================
>
> Could you point me to a benchmark for this scenario? I have checked
> https://github.com/gormanm/mmtests
> and we use lots of the same benchmarks but I'm not sure if we cover
> this particular scenario.
>

The NUMA imbalance part showed up as part of the general effort to
reconcile NUMA balancing with Load balancing. It's been known for years
that the two balancers disagreed to the extent that NUMA balancing retries
migrations multiple times just to keep things local leading to excessive
migrations. The full battery of tests that were used when I was trying
to reconcile the balancers and later working on Vincent's version is
as follows

scheduler-unbound
scheduler-forkintensive
scheduler-perfpipe
scheduler-perfpipe-cpufreq
scheduler-schbench
db-pgbench-timed-ro-small-xfs
hpc-nas-c-class-mpi-full-xfs
hpc-nas-c-class-mpi-half-xfs
hpc-nas-c-class-omp-full
hpc-nas-c-class-omp-half
hpc-nas-d-class-mpi-full-xfs
hpc-nas-d-class-mpi-half-xfs
hpc-nas-d-class-omp-full
hpc-nas-d-class-omp-half
io-dbench4-async-ext4
io-dbench4-async-xfs
jvm-specjbb2005-multi
jvm-specjbb2005-single
network-netperf-cstate
network-netperf-rr-cstate
network-netperf-rr-unbound
network-netperf-unbound
network-tbench
numa-autonumabench
workload-kerndevel-xfs
workload-shellscripts-xfs

Where there is -ext4 or -xfs, just remove the filesystem to get the base
configuration. i.e. io-dbench4-async-ext4 basic configuration is
io-dbench4-async. Both filesystems are sometimes tested because they
interact differently with the scheduler due to ext4 using a journal
thread and xfs using workqueues.

The imbalance one is most obvious with network-netperf-unbound running
on localhost. If the client/server are on separate nodes, it's obvious
from mpstat that two nodes are busy and it's migrating quite a bit. The
second effect is that NUMA balancing is active, trapping hinting faults
and migrating pages.

The biggest problem I have right now is that the wakeup path between tasks
that are local is slower than doing a remote wakeup via wake_list that
potentially sends an IPI which is ridiculous. The slower wakeup manifests
as a loss of throughput for netperf even though all the accesses are
local. At least that's what I'm looking at whenever I get the chance.

--
Mel Gorman
SUSE Labs