Re: [PATCH 1/3] sched: remove select_idle_core() for scalability

From: Subhra Mazumdar
Date: Wed May 02 2018 - 17:56:39 EST




On 05/01/2018 11:03 AM, Peter Zijlstra wrote:
On Mon, Apr 30, 2018 at 04:38:42PM -0700, Subhra Mazumdar wrote:
I also noticed a possible bug later in the merge code. Shouldn't it be:

if (busy < best_busy) {
ÂÂÂÂÂÂÂ best_busy = busy;
ÂÂÂÂÂÂÂ best_cpu = first_idle;
}
Uhh, quite. I did say it was completely untested, but yes.. /me dons the
brown paper bag.
I re-ran the test after fixing that bug but still get similar regressions
for hackbench, while similar improvements on Uperf. I didn't re-run the
Oracle DB tests but my guess is it will show similar improvement.

merge:

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups baseline %stdev patch %stdev
1ÂÂÂÂÂÂ 0.5742ÂÂÂÂÂÂÂÂ 21.13ÂÂ 0.5131 (10.64%) 4.11
2ÂÂÂÂÂÂ 0.5776ÂÂÂÂÂÂÂÂ 7.87ÂÂÂ 0.5387 (6.73%) 2.39
4ÂÂÂÂÂÂ 0.9578ÂÂÂÂÂÂÂÂ 1.12ÂÂÂ 1.0549 (-10.14%) 0.85
8ÂÂÂÂÂÂ 1.7018ÂÂÂÂÂÂÂÂ 1.35ÂÂÂ 1.8516 (-8.8%) 1.56
16ÂÂÂÂÂ 2.9955ÂÂÂÂÂÂÂÂ 1.36ÂÂÂ 3.2466 (-8.38%) 0.42
32ÂÂÂÂÂ 5.4354ÂÂÂÂÂÂÂÂ 0.59ÂÂÂ 5.7738 (-6.23%) 0.38

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline %stdev patch %stdev
8ÂÂÂÂÂÂ 49.47ÂÂÂÂÂÂÂÂÂÂ 0.35ÂÂÂ 51.1 (3.29%) 0.13
16ÂÂÂÂÂ 95.28ÂÂÂÂÂÂÂÂÂÂ 0.77ÂÂÂ 98.45 (3.33%) 0.61
32ÂÂÂÂÂ 156.77ÂÂÂÂÂÂÂÂÂ 1.17ÂÂÂ 170.97 (9.06%) 5.62
48ÂÂÂÂÂ 193.24ÂÂÂÂÂÂÂÂÂ 0.22ÂÂÂ 245.89 (27.25%) 7.26
64ÂÂÂÂÂ 216.21ÂÂÂÂÂÂÂÂÂ 9.33ÂÂÂ 316.43 (46.35%) 0.37
128ÂÂÂÂ 379.62ÂÂÂÂÂÂÂÂÂ 10.29ÂÂ 337.85 (-11%) 3.68

I tried using the next_cpu technique with the merge but didn't help. I am
open to suggestions.

merge + next_cpu:

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups baseline %stdev patch %stdev
1ÂÂÂÂÂÂ 0.5742ÂÂÂÂÂÂÂÂ 21.13ÂÂ 0.5107 (11.06%) 6.35
2ÂÂÂÂÂÂ 0.5776ÂÂÂÂÂÂÂÂ 7.87ÂÂÂ 0.5917 (-2.44%) 11.16
4ÂÂÂÂÂÂ 0.9578ÂÂÂÂÂÂÂÂ 1.12ÂÂÂ 1.0761 (-12.35%) 1.1
8ÂÂÂÂÂÂ 1.7018ÂÂÂÂÂÂÂÂ 1.35ÂÂÂ 1.8748 (-10.17%) 0.8
16ÂÂÂÂÂ 2.9955ÂÂÂÂÂÂÂÂ 1.36ÂÂÂ 3.2419 (-8.23%) 0.43
32ÂÂÂÂÂ 5.4354ÂÂÂÂÂÂÂÂ 0.59ÂÂÂ 5.6958 (-4.79%) 0.58

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline %stdev patch %stdev
8ÂÂÂÂÂÂ 49.47ÂÂÂÂÂÂÂÂÂÂ 0.35ÂÂÂ 51.65 (4.41%) 0.26
16ÂÂÂÂÂ 95.28ÂÂÂÂÂÂÂÂÂÂ 0.77ÂÂÂ 99.8 (4.75%) 1.1
32ÂÂÂÂÂ 156.77ÂÂÂÂÂÂÂÂÂ 1.17ÂÂÂ 168.37 (7.4%) 0.6
48ÂÂÂÂÂ 193.24ÂÂÂÂÂÂÂÂÂ 0.22ÂÂÂ 228.8 (18.4%) 1.75
64ÂÂÂÂÂ 216.21ÂÂÂÂÂÂÂÂÂ 9.33ÂÂÂ 287.11 (32.79%) 10.82
128ÂÂÂÂ 379.62ÂÂÂÂÂÂÂÂÂ 10.29ÂÂ 346.22 (-8.8%) 4.7

Finally there was earlier suggestion by Peter in select_task_rq_fair to
transpose the cpu offset that I had tried earlier but also regressed on
hackbench. Just wanted to mention that so we have closure on that.

transpose cpu offset in select_task_rq_fair:

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups baseline %stdev patch %stdev
1ÂÂÂÂÂÂ 0.5742ÂÂÂÂÂÂÂÂ 21.13ÂÂ 0.5251 (8.55%) 2.57
2ÂÂÂÂÂÂ 0.5776ÂÂÂÂÂÂÂÂ 7.87ÂÂÂ 0.5471 (5.28%) 11
4ÂÂÂÂÂÂ 0.9578ÂÂÂÂÂÂÂÂ 1.12ÂÂÂ 1.0148 (-5.95%) 1.97
8ÂÂÂÂÂÂ 1.7018ÂÂÂÂÂÂÂÂ 1.35ÂÂÂ 1.798 (-5.65%) 0.97
16ÂÂÂÂÂ 2.9955ÂÂÂÂÂÂÂÂ 1.36ÂÂÂ 3.088 (-3.09%) 2.7
32ÂÂÂÂÂ 5.4354ÂÂÂÂÂÂÂÂ 0.59ÂÂÂ 5.2815 (2.8%) 1.26