Re: sched: tweak select_idle_sibling to look for idle threads
From: Matt Fleming
Date: Thu May 05 2016 - 18:03:14 EST
On Wed, 04 May, at 12:37:01PM, Peter Zijlstra wrote:
>
> tbench wants select_idle_siblings() to just not exist; it goes happy
> when you just return target.
I've been playing with this patch a little bit by hitting it with
tbench on a Xeon, 12 cores with HT enabled, 2 sockets (48 cpus).
I see a throughput improvement for 16, 32, 64, 128 and 256 clients
when compared against mainline, so that's,
OLD_IDLE, ORDER_IDLE, NO_IDLE_CORE, NO_IDLE_CPU, NO_IDLE_SMT
vs.
NO_OLD_IDLE, NO_ORDER_IDLE, IDLE_CORE, IDLE_CPU, IDLE_SMT
See,
[OLD] Throughput 5345.6 MB/sec 16 clients 16 procs max_latency=0.277 ms avg_latency=0.211853 ms
[NEW] Throughput 5514.52 MB/sec 16 clients 16 procs max_latency=0.493 ms avg_latency=0.176441 ms
[OLD] Throughput 7401.76 MB/sec 32 clients 32 procs max_latency=1.804 ms avg_latency=0.451147 ms
[NEW] Throughput 10044.9 MB/sec 32 clients 32 procs max_latency=3.421 ms avg_latency=0.582529 ms
[OLD] Throughput 13265.9 MB/sec 64 clients 64 procs max_latency=7.395 ms avg_latency=0.927147 ms
[NEW] Throughput 13929.6 MB/sec 64 clients 64 procs max_latency=7.022 ms avg_latency=1.017059 ms
[OLD] Throughput 12827.8 MB/sec 128 clients 128 procs max_latency=16.256 ms avg_latency=2.763706 ms
[NEW] Throughput 13364.2 MB/sec 128 clients 128 procs max_latency=16.630 ms avg_latency=3.002971 ms
[OLD] Throughput 12653.1 MB/sec 256 clients 256 procs max_latency=44.722 ms avg_latency=5.741647 ms
[NEW] Throughput 12965.7 MB/sec 256 clients 256 procs max_latency=59.061 ms avg_latency=8.699118 ms
For throughput changes to 1, 2, 4 and 8 clients it's more of a mixture
with sometimes the old config winning and sometimes losing.
[OLD] Throughput 488.819 MB/sec 1 clients 1 procs max_latency=0.191 ms avg_latency=0.058794 ms
[NEW] Throughput 486.106 MB/sec 1 clients 1 procs max_latency=0.085 ms avg_latency=0.045794 ms
[OLD] Throughput 925.987 MB/sec 2 clients 2 procs max_latency=0.201 ms avg_latency=0.090882 ms
[NEW] Throughput 954.944 MB/sec 2 clients 2 procs max_latency=0.199 ms avg_latency=0.064294 ms
[OLD] Throughput 1764.02 MB/sec 4 clients 4 procs max_latency=0.160 ms avg_latency=0.075206 ms
[NEW] Throughput 1756.8 MB/sec 4 clients 4 procs max_latency=0.105 ms avg_latency=0.062382 ms
[OLD] Throughput 3384.22 MB/sec 8 clients 8 procs max_latency=0.276 ms avg_latency=0.099441 ms
[NEW] Throughput 3375.47 MB/sec 8 clients 8 procs max_latency=0.103 ms avg_latency=0.064176 ms
Looking at latency, the new code consistently performs worse at the
top end for 256 clients. Admittedly at that point the machine is
pretty overloaded. Things are much better at the lower end.
One thing I haven't yet done is twiddled the bits individually to see
what the best combination is. Have you settled on the right settings
yet?