Re: [PATCH v4 00/10] sched/fair: rework the CFS load balance

From: Phil Auld
Date: Thu Oct 31 2019 - 09:57:30 EST


Hi Vincent,

On Wed, Oct 30, 2019 at 06:25:49PM +0100 Vincent Guittot wrote:
> On Wed, 30 Oct 2019 at 15:39, Phil Auld <pauld@xxxxxxxxxx> wrote:
> > > That fact that the 4 nodes works well but not the 8 nodes is a bit
> > > surprising except if this means more NUMA level in the sched_domain
> > > topology
> > > Could you give us more details about the sched domain topology ?
> > >
> >
> > The 8-node system has 5 sched domain levels. The 4-node system only
> > has 3.
>
> That's an interesting difference. and your additional tests on a 8
> nodes with 3 level tends to confirm that the number of level make a
> difference
> I need to study a bit more how this can impact the spread of tasks

So I think I understand what my numbers have been showing.

I believe the numa balancing is causing problems.

Here's numbers from the test on 5.4-rc3+ without your series:

echo 1 > /proc/sys/kernel/numa_balancing
lu.C.x_156_GROUP_1 Average 10.87 0.00 0.00 11.49 36.69 34.26 30.59 32.10
lu.C.x_156_GROUP_2 Average 20.15 16.32 9.49 24.91 21.07 20.93 21.63 21.50
lu.C.x_156_GROUP_3 Average 21.27 17.23 11.84 21.80 20.91 20.68 21.11 21.16
lu.C.x_156_GROUP_4 Average 19.44 6.53 8.71 19.72 22.95 23.16 28.85 26.64
lu.C.x_156_GROUP_5 Average 20.59 6.20 11.32 14.63 28.73 30.36 22.20 21.98
lu.C.x_156_NORMAL_1 Average 20.50 19.95 20.40 20.45 18.75 19.35 18.25 18.35
lu.C.x_156_NORMAL_2 Average 17.15 19.04 18.42 18.69 21.35 21.42 20.00 19.92
lu.C.x_156_NORMAL_3 Average 18.00 18.15 17.55 17.60 18.90 18.40 19.90 19.75
lu.C.x_156_NORMAL_4 Average 20.53 20.05 20.21 19.11 19.00 19.47 19.37 18.26
lu.C.x_156_NORMAL_5 Average 18.72 18.78 19.72 18.50 19.67 19.72 21.11 19.78

============156_GROUP========Mop/s===================================
min q1 median q3 max
1564.63 3003.87 3928.23 5411.13 8386.66
============156_GROUP========time====================================
min q1 median q3 max
243.12 376.82 519.06 678.79 1303.18
============156_NORMAL========Mop/s===================================
min q1 median q3 max
13845.6 18013.8 18545.5 19359.9 19647.4
============156_NORMAL========time====================================
min q1 median q3 max
103.78 105.32 109.95 113.19 147.27

(This one above is especially bad... we don't usually see 0.00s, but overall it's
basically on par. It's reflected in the spread of the results).


echo 0 > /proc/sys/kernel/numa_balancing
lu.C.x_156_GROUP_1 Average 17.75 19.30 21.20 21.20 20.20 20.80 18.90 16.65
lu.C.x_156_GROUP_2 Average 18.38 19.25 21.00 20.06 20.19 20.31 19.56 17.25
lu.C.x_156_GROUP_3 Average 21.81 21.00 18.38 16.86 20.81 21.48 18.24 17.43
lu.C.x_156_GROUP_4 Average 20.48 20.96 19.61 17.61 17.57 19.74 18.48 21.57
lu.C.x_156_GROUP_5 Average 23.32 21.96 19.16 14.28 21.44 22.56 17.00 16.28
lu.C.x_156_NORMAL_1 Average 19.50 19.83 19.58 19.25 19.58 19.42 19.42 19.42
lu.C.x_156_NORMAL_2 Average 18.90 18.40 20.00 19.80 19.70 19.30 19.80 20.10
lu.C.x_156_NORMAL_3 Average 19.45 19.09 19.91 20.09 19.45 18.73 19.45 19.82
lu.C.x_156_NORMAL_4 Average 19.64 19.27 19.64 19.00 19.82 19.55 19.73 19.36
lu.C.x_156_NORMAL_5 Average 18.75 19.42 20.08 19.67 18.75 19.50 19.92 19.92

============156_GROUP========Mop/s===================================
min q1 median q3 max
14956.3 16346.5 17505.7 18440.6 22492.7
============156_GROUP========time====================================
min q1 median q3 max
90.65 110.57 116.48 124.74 136.33
============156_NORMAL========Mop/s===================================
min q1 median q3 max
29801.3 30739.2 31967.5 32151.3 34036
============156_NORMAL========time====================================
min q1 median q3 max
59.91 63.42 63.78 66.33 68.42


Note there is a significant improvement already. But we are seeing imbalance due to
using weighted load and averages. In this case it's only 55% slowdown rather than
the 5x. But the overall performance if the benchmark is also much better in both cases.



Here's the same test, same system with the full series (lb_v4a as I've been calling it):

echo 1 > /proc/sys/kernel/numa_balancing
lu.C.x_156_GROUP_1 Average 18.59 19.36 19.50 18.86 20.41 20.59 18.27 20.41
lu.C.x_156_GROUP_2 Average 19.52 20.52 20.48 21.17 19.52 19.09 17.70 18.00
lu.C.x_156_GROUP_3 Average 20.58 20.71 20.17 20.50 18.46 19.50 18.58 17.50
lu.C.x_156_GROUP_4 Average 18.95 19.63 19.47 19.84 18.79 19.84 20.84 18.63
lu.C.x_156_GROUP_5 Average 16.85 17.96 19.89 19.15 19.26 20.48 21.70 20.70
lu.C.x_156_NORMAL_1 Average 18.04 18.48 20.00 19.72 20.72 20.48 18.48 20.08
lu.C.x_156_NORMAL_2 Average 18.22 20.56 19.50 19.39 20.67 19.83 18.44 19.39
lu.C.x_156_NORMAL_3 Average 17.72 19.61 19.56 19.17 20.17 19.89 20.78 19.11
lu.C.x_156_NORMAL_4 Average 18.05 19.74 20.21 19.89 20.32 20.26 19.16 18.37
lu.C.x_156_NORMAL_5 Average 18.89 19.95 20.21 20.63 19.84 19.26 19.26 17.95

============156_GROUP========Mop/s===================================
min q1 median q3 max
13460.1 14949 15851.7 16391.4 18993
============156_GROUP========time====================================
min q1 median q3 max
107.35 124.39 128.63 136.4 151.48
============156_NORMAL========Mop/s===================================
min q1 median q3 max
14418.5 18512.4 19049.5 19682 19808.8
============156_NORMAL========time====================================
min q1 median q3 max
102.93 103.6 107.04 110.14 141.42


echo 0 > /proc/sys/kernel/numa_balancing
lu.C.x_156_GROUP_1 Average 19.00 19.33 19.33 19.58 20.08 19.67 19.83 19.17
lu.C.x_156_GROUP_2 Average 18.55 19.91 20.09 19.27 18.82 19.27 19.91 20.18
lu.C.x_156_GROUP_3 Average 18.42 19.08 19.75 19.00 19.50 20.08 20.25 19.92
lu.C.x_156_GROUP_4 Average 18.42 19.83 19.17 19.50 19.58 19.83 19.83 19.83
lu.C.x_156_GROUP_5 Average 19.17 19.42 20.17 19.92 19.25 18.58 19.92 19.58
lu.C.x_156_NORMAL_1 Average 19.25 19.50 19.92 18.92 19.33 19.75 19.58 19.75
lu.C.x_156_NORMAL_2 Average 19.42 19.25 17.83 18.17 19.83 20.50 20.42 20.58
lu.C.x_156_NORMAL_3 Average 18.58 19.33 19.75 18.25 19.42 20.25 20.08 20.33
lu.C.x_156_NORMAL_4 Average 19.00 19.55 19.73 18.73 19.55 20.00 19.64 19.82
lu.C.x_156_NORMAL_5 Average 19.25 19.25 19.50 18.75 19.92 19.58 19.92 19.83

============156_GROUP========Mop/s===================================
min q1 median q3 max
28520.1 29024.2 29042.1 29367.4 31235.2
============156_GROUP========time====================================
min q1 median q3 max
65.28 69.43 70.21 70.25 71.49
============156_NORMAL========Mop/s===================================
min q1 median q3 max
28974.5 29806.5 30237.1 30907.4 31830.1
============156_NORMAL========time====================================
min q1 median q3 max
64.06 65.97 67.43 68.41 70.37


This all now makes sense. Looking at the numa balancing code a bit you can see
that it still uses load so it will still be subject to making bogus decisions
based on the weighted load. In this case it's been actively working against the
load balancer because of that.

I think with the three numa levels on this system the numa balancing was able to
win more often. We don't see the same level of this result on systems with only
one SD_NUMA level.

Following the other part of this thread, I have to add that I'm of the opinion
that the weighted load (which is all we have now I believe) really should be used
only in extreme cases of overload to deal with fairness. And even then maybe not.
As far as I can see, once the fair group scheduling is involved, that load is
basically a random number between 1 and 1024. It really has no bearing on how
much "load" a task will put on a cpu. Any comparison of that to cpu capacity
is pretty meaningless.

I'm sure there are workloads for which the numa balancing is more important. But
even then I suspect it is making the wrong decisions more often than not. I think
a similar rework may be needed :)

I've asked our perf team to try the full battery of tests with numa balancing
disabled to see what it shows across the board.


Good job on this and thanks for the time looking at my specific issues.


As far as this series is concerned, and as far as it matters:

Acked-by: Phil Auld <pauld@xxxxxxxxxx>



Cheers,
Phil

--