Re: sched: tweak select_idle_sibling to look for idle threads
From: Mike Galbraith
Date: Wed May 11 2016 - 00:17:59 EST
On Wed, 2016-05-11 at 03:16 +0800, Yuyang Du wrote:
> On Tue, May 10, 2016 at 05:26:05PM +0200, Mike Galbraith wrote:
> > On Tue, 2016-05-10 at 09:49 +0200, Mike Galbraith wrote:
> >
> > > Only whacking
> > > cfs_rq_runnable_load_avg() with a rock makes schbench -m -t
> > > -a work well. 'Course a rock in its gearbox also
> > > rendered load balancing fairly busted for the general case :)
> >
> > Smaller rock doesn't injure heavy tbench, but more importantly, still
> > demonstrates the issue when you want full spread.
> >
> > schbench -m4 -t38 -a
> >
> > cputime 30000 threads 38 p99 177
> > cputime 30000 threads 39 p99 10160
> >
> > LB_TIP_AVG_HIGH
> > cputime 30000 threads 38 p99 193
> > cputime 30000 threads 39 p99 184
> > cputime 30000 threads 40 p99 203
> > cputime 30000 threads 41 p99 202
> > cputime 30000 threads 42 p99 205
> > cputime 30000 threads 43 p99 218
> > cputime 30000 threads 44 p99 237
> > cputime 30000 threads 45 p99 245
> > cputime 30000 threads 46 p99 262
> > cputime 30000 threads 47 p99 296
> > cputime 30000 threads 48 p99 3308
> >
> > 47*4+4=nr_cpus yay
>
> yay... and haha, "a perfect world"...
Yup.. for this load.
> > ---
> > kernel/sched/fair.c | 3 +++
> > kernel/sched/features.h | 1 +
> > 2 files changed, 4 insertions(+)
> >
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3027,6 +3027,9 @@ void remove_entity_load_avg(struct sched
> >
> > static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq)
> > {
> > +> > > > if (sched_feat(LB_TIP_AVG_HIGH) && cfs_rq->load.weight > cfs_rq->runnable_load_avg*2)
> > +> > > > > > return cfs_rq->runnable_load_avg + min_t(unsigned long, NICE_0_LOAD,
> > +> > > > > > > > > > > > > > > > cfs_rq->load.weight/2);
> > > > > > return cfs_rq->runnable_load_avg;
> > }
>
> cfs_rq->runnable_load_avg is for sure no greater than (in this case much less
> than, maybe 1/2 of) load.weight, whereas load_avg is not necessarily a rock
> in gearbox that only impedes speed up, but also speed down.
Yeah, just like everything else, it'll cuts both ways (why you can't
win the sched game). If I can believe tbench, at tasks=cpus, reducing
lag increased utilization and reduced latency a wee bit, as did the
reserve thing once a booboo got fixed up. Makes sense, robbing Peter
to pay Paul should work out better for Paul.
NO_LB_TIP_AVG_HIGH
Throughput 27132.9 MB/sec 96 clients 96 procs max_latency=7.656 ms
Throughput 28464.1 MB/sec 96 clients 96 procs max_latency=9.905 ms
Throughput 25369.8 MB/sec 96 clients 96 procs max_latency=7.192 ms
Throughput 25670.3 MB/sec 96 clients 96 procs max_latency=5.874 ms
Throughput 29309.3 MB/sec 96 clients 96 procs max_latency=1.331 ms
avg 27189 1.000 6.391 1.000
NO_LB_TIP_AVG_HIGH IDLE_RESERVE
Throughput 24437.5 MB/sec 96 clients 96 procs max_latency=1.837 ms
Throughput 29464.7 MB/sec 96 clients 96 procs max_latency=1.594 ms
Throughput 28023.6 MB/sec 96 clients 96 procs max_latency=1.494 ms
Throughput 28299.0 MB/sec 96 clients 96 procs max_latency=10.404 ms
Throughput 29072.1 MB/sec 96 clients 96 procs max_latency=5.575 ms
avg 27859 1.024 4.180 0.654
LB_TIP_AVG_HIGH NO_IDLE_RESERVE
Throughput 29068.1 MB/sec 96 clients 96 procs max_latency=5.599 ms
Throughput 26435.6 MB/sec 96 clients 96 procs max_latency=3.703 ms
Throughput 23930.0 MB/sec 96 clients 96 procs max_latency=7.742 ms
Throughput 29464.2 MB/sec 96 clients 96 procs max_latency=1.549 ms
Throughput 24250.9 MB/sec 96 clients 96 procs max_latency=1.518 ms
avg 26629 0.979 4.022 0.629
LB_TIP_AVG_HIGH IDLE_RESERVE
Throughput 30340.1 MB/sec 96 clients 96 procs max_latency=1.465 ms
Throughput 29042.9 MB/sec 96 clients 96 procs max_latency=4.515 ms
Throughput 26718.7 MB/sec 96 clients 96 procs max_latency=1.822 ms
Throughput 28694.4 MB/sec 96 clients 96 procs max_latency=1.503 ms
Throughput 28918.2 MB/sec 96 clients 96 procs max_latency=7.599 ms
avg 28742 1.057 3.380 0.528
> But I really don't know the load references in select_task_rq() should be
> what kind. So maybe the real issue is a mix of them, i.e., conflated balancing
> and just wanting an idle cpu. ?
Depends on the goal. For both, load lagging reality means the high
frequency component is squelched, meaning less migration cost, but also
higher latency due to stacking. It's a tradeoff where Chris' latency
is everything" benchmark, and _maybe_ the real world load it's based
upon is on Peter's end of the rob Peter to pay Paul transaction. The
benchmark says it definitely is, the real world load may have already
been fixed up by the select_idle_sibling() rewrite.
-Mike