Re: sched: tweak select_idle_sibling to look for idle threads

From: Chris Mason
Date: Tue Apr 12 2016 - 09:28:40 EST


On Tue, Apr 12, 2016 at 06:44:08AM +0200, Mike Galbraith wrote:
> On Mon, 2016-04-11 at 20:30 -0400, Chris Mason wrote:
> > On Mon, Apr 11, 2016 at 06:54:21AM +0200, Mike Galbraith wrote:
>
> > > > Ok, I was able to reproduce this by stuffing tbench_srv and tbench onto
> > > > just socket 0. Version 2 below fixes things for me, but I'm hoping
> > > > someone can suggest a way to get task_hot() buddy checks without the rq
> > > > lock.
> > > >
> > > > I haven't run this on production loads yet, but our 4.0 patch for this
> > > > uses task_hot(), so I'd expect it to be on par. If this doesn't fix it
> > > > for you, I'll dig up a similar machine on Monday.
> > >
> > > My box stopped caring. I personally would be reluctant to apply it
> > > without a "you asked for it" button or a large pile of benchmark
> > > results. Lock banging or not, full scan existing makes me nervous.
> >
> >
> > We can use a bitmap at the socket level to keep track of which cpus are
> > idle. I'm sure there are better places for the array and better ways to
> > allocate, this is just a rough cut to make sure the idle tracking works.
>
> See e0a79f529d5b:
>
> pre 15.22 MB/sec 1 procs
> post 252.01 MB/sec 1 procs
>
> You can make traverse cycles go away, but those cycles, while precious,
> are not the most costly cycles. The above was 1 tbench pair in an
> otherwise idle box.. ie it wasn't traverse cycles that demolished it.

Agreed, this is why the decision not to scan is so important. But while
I've been describing this patch in terms of latency, latency is really
the symptom instead of the goal. Without these patches, workloads that
do want to fully utilize the hardware are basically getting one fewer
core of utilization. It's true that we define 'fully utilize' with an
upper bound on application response time, but we're not talking high
frequency trading here.

It clearly shows up in our graphs. CPU idle is higher (the lost core),
CPU user time is lower, average system load is higher (procs waiting on
a fewer number of core).

We measure this internally with scheduling latency because that's the
easiest way to talk about it across a wide variety of hardware.

>
> -Mike
>
> (p.s. SCHED_IDLE is dinky bandwidth fair class)

Ugh, not my best quick patch, but you get the idea I was going for. I
can always add the tunable to flip things on/off but I'd prefer that we
find a good set of defaults, mostly so the FB production runtime is the
common config instead of the special snowflake.

-chris