[RFC PATCH 00/11] Reconcile NUMA balancing decisions with the load balancer

From: Mel Gorman
Date: Wed Feb 12 2020 - 04:37:00 EST


The NUMA balancer makes placement decisions on tasks that partially
take the load balancer into account and vice versa but there are
inconsistencies. This can result in placement decisions that override
each other leading to unnecessary migrations -- both task placement and
page placement. This is a prototype series that attempts to reconcile the
decisions. It's a bit premature but it would also need to be reconciled
with Vincent's series "[PATCH 0/4] remove runnable_load_avg and improve
group_classify"

The first three patches are unrelated and are either pending in tip or
should be but they were part of the testing of this series so I have to
mention them.

The fourth and fifth patches are tracing only and was needed to get
sensible data out of ftrace with respect to task placement for NUMA
balancing. Patches 6-8 reduce overhead and reduce the changes of NUMA
balancing overriding itself. Patches 9-11 try and bring the CPU placement
decisions of NUMA balancing in line with the load balancer.

In terms of Vincent's patches, I have not checked but I expect conflicts
to be with patches 10 and 11.

Note that this is not necessarily a universal performance win although
performance results are generally ok (small gains/losses depending on
the machine and workload). However, task migrations, page migrations,
variability and overall overhead are generally reduced.

Tests are still running and take quite a long time so I do not have a
full picture. The main reference workload I used was specjbb running one
JVM per node which typically would be expected to split evenly. It's
an interesting workload because the number of "warehouses" does not
linearly related to the number of running tasks due to the creation of
GC threads and other interfering activity. The mmtests configuration used
is jvm-specjbb2005-multi with two runs -- one with ftrace enabling
relevant scheduler tracepoints.

The baseline is taken from late in the 5.6 merge window plus patches 1-4
to take into account patches that are already in flight and the tracing
patch I relied on for analysis.

The headline performance of the series looks like

baseline-v1 lboverload-v1
Hmean tput-1 37842.47 ( 0.00%) 42391.63 * 12.02%*
Hmean tput-2 94225.00 ( 0.00%) 91937.32 ( -2.43%)
Hmean tput-3 141855.04 ( 0.00%) 142100.59 ( 0.17%)
Hmean tput-4 186799.96 ( 0.00%) 184338.10 ( -1.32%)
Hmean tput-5 229918.54 ( 0.00%) 230894.68 ( 0.42%)
Hmean tput-6 271006.38 ( 0.00%) 271367.35 ( 0.13%)
Hmean tput-7 312279.37 ( 0.00%) 314141.97 ( 0.60%)
Hmean tput-8 354916.09 ( 0.00%) 357029.57 ( 0.60%)
Hmean tput-9 397299.92 ( 0.00%) 399832.32 ( 0.64%)
Hmean tput-10 438169.79 ( 0.00%) 442954.02 ( 1.09%)
Hmean tput-11 476864.31 ( 0.00%) 484322.15 ( 1.56%)
Hmean tput-12 512327.04 ( 0.00%) 519117.29 ( 1.33%)
Hmean tput-13 528983.50 ( 0.00%) 530772.34 ( 0.34%)
Hmean tput-14 537757.24 ( 0.00%) 538390.58 ( 0.12%)
Hmean tput-15 535328.60 ( 0.00%) 539402.88 ( 0.76%)
Hmean tput-16 539356.59 ( 0.00%) 545617.63 ( 1.16%)
Hmean tput-17 535370.94 ( 0.00%) 547217.95 ( 2.21%)
Hmean tput-18 540510.94 ( 0.00%) 548145.71 ( 1.41%)
Hmean tput-19 536737.76 ( 0.00%) 545281.39 ( 1.59%)
Hmean tput-20 537509.85 ( 0.00%) 543759.71 ( 1.16%)
Hmean tput-21 534632.44 ( 0.00%) 544848.03 ( 1.91%)
Hmean tput-22 531538.29 ( 0.00%) 540987.41 ( 1.78%)
Hmean tput-23 523364.37 ( 0.00%) 536640.28 ( 2.54%)
Hmean tput-24 530613.55 ( 0.00%) 531431.12 ( 0.15%)
Stddev tput-1 1569.78 ( 0.00%) 674.58 ( 57.03%)
Stddev tput-2 8.49 ( 0.00%) 1368.25 (-16025.00%)
Stddev tput-3 4125.26 ( 0.00%) 1120.06 ( 72.85%)
Stddev tput-4 4677.51 ( 0.00%) 717.71 ( 84.66%)
Stddev tput-5 3387.75 ( 0.00%) 1774.13 ( 47.63%)
Stddev tput-6 1400.07 ( 0.00%) 1079.75 ( 22.88%)
Stddev tput-7 4374.16 ( 0.00%) 2571.75 ( 41.21%)
Stddev tput-8 2370.22 ( 0.00%) 2918.23 ( -23.12%)
Stddev tput-9 3893.33 ( 0.00%) 2708.93 ( 30.42%)
Stddev tput-10 6260.02 ( 0.00%) 3935.05 ( 37.14%)
Stddev tput-11 3989.50 ( 0.00%) 6443.16 ( -61.50%)
Stddev tput-12 685.19 ( 0.00%) 12999.45 (-1797.21%)
Stddev tput-13 3251.98 ( 0.00%) 9311.18 (-186.32%)
Stddev tput-14 2793.78 ( 0.00%) 6175.87 (-121.06%)
Stddev tput-15 6777.62 ( 0.00%) 25942.33 (-282.76%)
Stddev tput-16 25057.04 ( 0.00%) 4227.08 ( 83.13%)
Stddev tput-17 22336.80 ( 0.00%) 16890.66 ( 24.38%)
Stddev tput-18 6662.36 ( 0.00%) 3015.10 ( 54.74%)
Stddev tput-19 20395.79 ( 0.00%) 1098.14 ( 94.62%)
Stddev tput-20 17140.27 ( 0.00%) 9019.15 ( 47.38%)
Stddev tput-21 5176.73 ( 0.00%) 4300.62 ( 16.92%)
Stddev tput-22 28279.32 ( 0.00%) 6544.98 ( 76.86%)
Stddev tput-23 25368.87 ( 0.00%) 3621.09 ( 85.73%)
Stddev tput-24 3082.28 ( 0.00%) 2500.33 ( 18.88%)

Generally, this is showing a small gain in performance but it's
borderline noise. However, in most cases, variability between
the JVM performance is much reduced except at the point where
a node is almost fully utilised.

The high-level NUMA stats from /proc/vmstat look like this

NUMA base-page range updates 1710927.00 2199691.00
NUMA PTE updates 871759.00 1060491.00
NUMA PMD updates 1639.00 2225.00
NUMA hint faults 772179.00 967165.00
NUMA hint local faults % 647558.00 845357.00
NUMA hint local percent 83.86 87.41
NUMA pages migrated 64920.00 45254.00
AutoNUMA cost 3874.10 4852.08

The percentage of local hits is higher (87.41% vs 84.86%). The
number of pages migrated is reduced by 30%. The downside is
that there are spikes when scanning is higher because in some
cases NUMA balancing will not move a task to a local node if
the CPU load balancer would immediately override it but it's
not straight-forward to fix this in a universal way and should
be a separate series.

A separate run gathered information from ftrace and analysed it
offline.

5.5.0 5.5.0
baseline lboverload-v1
Migrate failed no CPU 1934.00 4999.00
Migrate failed move to idle 0.00 0.00
Migrate failed swap task fail 981.00 2810.00
Task Migrated swapped 6765.00 12609.00
Task Migrated swapped local NID 0.00 0.00
Task Migrated swapped within group 644.00 1105.00
Task Migrated idle CPU 14776.00 750.00
Task Migrated idle CPU local NID 0.00 0.00
Task Migrate retry 2521.00 7564.00
Task Migrate retry success 0.00 0.00
Task Migrate retry failed 2521.00 7564.00
Load Balance cross NUMA 1222195.00 1223454.00

"Migrate failed no CPU" is the times when NUMA balancing did not
find a suitable page on a preferred node. This is increased because
the series avoids making decisions that the LB would override.

"Migrate failed swap task fail" is when migrate_swap fails and it
can fail for a lot of reasons.

"Task Migrated swapped" is also higher but this is somewhat positive.
It is when two tasks are swapped to keep load neutral or improved
from the perspective of the load balancer. The series attempts to
swap tasks that both move to their preferred node for example.

"Task Migrated idle CPU" is also reduced. Again, this is a reflection
that the series is trying to avoid NUMA Balancer and LB fighting
each other.

"Task Migrate retry failed" happens when NUMA balancing makes multiple
attempts to place a task on a preferred node.

So broadly speaking, similar or better performance with fewer page
migrations and less conflict between the two balancers for at least one
workload and one machine. There is room for improvement and I need data
on more workloads and machines but an early review would be nice.

include/trace/events/sched.h | 51 +++--
kernel/sched/core.c | 11 --
kernel/sched/fair.c | 430 ++++++++++++++++++++++++++++++++-----------
kernel/sched/sched.h | 13 ++
4 files changed, 379 insertions(+), 126 deletions(-)

--
2.16.4