Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE

From: Josef Bacik
Date: Mon Jul 06 2015 - 15:41:44 EST


On 07/06/2015 02:36 PM, Mike Galbraith wrote:
On Mon, 2015-07-06 at 10:34 -0400, Josef Bacik wrote:
On 07/06/2015 01:13 AM, Mike Galbraith wrote:
Hm. Piddling with pgbench, which doesn't seem to collapse into a
quivering heap when load exceeds cores these days, deltas weren't all
that impressive, but it does appreciate the extra effort a bit, and a
bit more when clients receive it as well.

If you test, and have time to piddle, you could try letting wake_wide()
return 1 + sched_feat(WAKE_WIDE_IDLE) instead of adding only if wakee is
the dispatcher.

Numbers from my little desktop box.

NO_WAKE_WIDE_IDLE
postgres@homer:~> pgbench.sh
clients 8 tps = 116697.697662
clients 12 tps = 115160.230523
clients 16 tps = 115569.804548
clients 20 tps = 117879.230514
clients 24 tps = 118281.753040
clients 28 tps = 116974.796627
clients 32 tps = 119082.163998 avg 117092.239 1.000

WAKE_WIDE_IDLE
postgres@homer:~> pgbench.sh
clients 8 tps = 124351.735754
clients 12 tps = 124419.673135
clients 16 tps = 125050.716498
clients 20 tps = 124813.042352
clients 24 tps = 126047.442307
clients 28 tps = 125373.719401
clients 32 tps = 126711.243383 avg 125252.510 1.069 1.000

WAKE_WIDE_IDLE (clients as well as server)
postgres@homer:~> pgbench.sh
clients 8 tps = 130539.795246
clients 12 tps = 128984.648554
clients 16 tps = 130564.386447
clients 20 tps = 129149.693118
clients 24 tps = 130211.119780
clients 28 tps = 130325.355433
clients 32 tps = 129585.656963 avg 129908.665 1.109 1.037

I had a typo in my script, so those desktop box numbers were all doing
the same number of clients. It doesn't invalidate anything, but the
individual deltas are just run to run variance.. not to mention that
single cache box is not all that interesting for this anyway. That
happens when interconnect becomes a player.

I have time for twiddling, we're carrying ye olde WAKE_IDLE until we get
this solved upstream and then I'll rip out the old and put in the new,
I'm happy to screw around until we're all happy. I'll throw this in a
kernel this morning and run stuff today. Barring any issues with the
testing infrastructure I should have results today. Thanks,

I'll be interested in your results. Taking pgbench to a little NUMA
box, I'm seeing _nada_ outside of variance with master (crap). I have a
way to win significantly for _older_ kernels, and that win over master
_may_ provide some useful insight, but I don't trust postgres/pgbench as
far as I can toss the planet, so don't have a warm fuzzy about trying to
use it to approximate your real world load.

BTW, what's your topology look like (numactl --hardware).


So the NO_WAKE_WIDE_IDLE results are very good, almost the same as the baseline with a slight regression at lower RPS and a slight improvement at high RPS. I'm running with WAKE_WIDE_IDLE set now, that should be done soonish and then I'll do the 1 + sched_feat(WAKE_WIDE_IDLE) thing next and those results should come in the morning. Here is the numa information from one of the boxes in the test cluster

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29
node 0 size: 15890 MB
node 0 free: 2651 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19 30 31 32 33 34 35 36 37 38 39
node 1 size: 16125 MB
node 1 free: 2063 MB
node distances:
node 0 1
0: 10 20
1: 20 10

Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/