Re: sched: tweak select_idle_sibling to look for idle threads
From: Mike Galbraith
Date: Mon May 02 2016 - 10:50:22 EST
On Mon, 2016-05-02 at 10:46 +0200, Peter Zijlstra wrote:
> On Sun, May 01, 2016 at 09:12:33AM +0200, Mike Galbraith wrote:
>
> > Nah, tbench is just variance prone. It got dinged up at clients=cores
> > on my desktop box, on 4 sockets the high end got seriously dinged up.
>
>
> Ha!, check this:
>
> root@ivb-ep:~# echo OLD_IDLE > /debug/sched_features ; echo
> NO_ORDER_IDLE > /debug/sched_features ; echo IDLE_CORE >
> /debug/sched_features ; echo NO_FORCE_CORE > /debug/sched_features ;
> tbench 20 -t 10
>
> Throughput 5956.32 MB/sec 20 clients 20 procs max_latency=0.126 ms
>
>
> root@ivb-ep:~# echo OLD_IDLE > /debug/sched_features ; echo ORDER_IDLE >
> /debug/sched_features ; echo IDLE_CORE > /debug/sched_features ; echo
> NO_FORCE_CORE > /debug/sched_features ; tbench 20 -t 10
>
> Throughput 5011.86 MB/sec 20 clients 20 procs max_latency=0.116 ms
>
>
>
> That little ORDER_IDLE thing hurts silly. That's a little patch I had
> lying about because some people complained that tasks hop around the
> cache domain, instead of being stuck to a CPU.
>
> I suspect what happens is that by all CPUs starting to look for idle at
> the same place (the first cpu in the domain) they all find the same idle
> cpu and things pile up.
>
> The old behaviour, where they all start iterating from where they were
> avoids some of that, at the cost of making tasks hop around.
>
> Lets see if I can get the same behaviour out of the cpumask iteration
> code..
Order is one thing, but what the old behavior does first and foremost
is when the box starts getting really busy, only looking at target's
sibling shuts select_idle_sibling() down instead of letting it wreck
things. Once cores are moving, there are no large piles of anything
left to collect other than pain.
We really need a good way to know we're not gonna turn the box into a
shredder. The wake_wide() thing might help some, likely wants some
twiddling, in_interrupt() might be another time to try hard.
Anyway, the has_idle_cores business seems to shut select_idle_sibling()
down rather nicely when the the box gets busy. Forcing either core,
target's sibling or go fish turned in a top end win on 48 rq/socket.
Oh btw, did you know single socket boxen have no sd_busy? That doesn't
look right.
fromm:~/:[0]# for i in 1 2 4 8 16 32 64 128 256; do tbench.sh $i 30 2>&1| grep Throughput; done
Throughput 511.016 MB/sec 1 clients 1 procs max_latency=0.113 ms
Throughput 1042.03 MB/sec 2 clients 2 procs max_latency=0.098 ms
Throughput 1953.12 MB/sec 4 clients 4 procs max_latency=0.236 ms
Throughput 3694.99 MB/sec 8 clients 8 procs max_latency=0.308 ms
Throughput 7080.95 MB/sec 16 clients 16 procs max_latency=0.442 ms
Throughput 13444.7 MB/sec 32 clients 32 procs max_latency=1.417 ms
Throughput 20191.3 MB/sec 64 clients 64 procs max_latency=4.554 ms
Throughput 41115.4 MB/sec 128 clients 128 procs max_latency=13.414 ms
Throughput 66844.4 MB/sec 256 clients 256 procs max_latency=50.069 ms
5226 /*
5227 * If there are idle cores to be had, go find one.
5228 */
5229 if (sched_feat(IDLE_CORE) && test_idle_cores(target)) {
5230 i = select_idle_core(p, target);
5231 if ((unsigned)i < nr_cpumask_bits)
5232 return i;
5233
5234 /*
5235 * Failed to find an idle core; stop looking for one.
5236 */
5237 clear_idle_cores(target);
5238 }
5239 #if 1
5240 for_each_cpu(i, cpu_smt_mask(target)) {
5241 if (idle_cpu(i))
5242 return i;
5243 }
5244
5245 return target;
5246 #endif
5247
5248 if (sched_feat(FORCE_CORE)) {