Re: [PATCH] cpuidle: menu: use nr_running instead of cpuload forcalculating perf mult

From: Peter Zijlstra
Date: Mon Jun 04 2012 - 13:16:19 EST


On Mon, 2012-06-04 at 09:53 -0700, Arjan van de Ven wrote:
> >
> > False, you can have 0 idle time and still have low load.
>
> 1 is not low in this context fwiw.

I think you're mis-understanding the load number you're using. I suspect
you're expecting something like the load-avg top/uptime provide. You're
very much not using anything similar.

Nor do we compute anything like that, and I want to avoid having to
compute anything like that because its expensive.

> >> but because idle
> >> time tends to be bursty, we can still be idle for, say, a millisecond
> >> every 10 milliseconds. In this scenario, the load average is used to
> >> ensure that the 200 usecond cost of exiting idle is acceptable.
> >
> > So what you're saying is that if you have 1ms idle in 10ms, it might not
> > be a continuous 1ms. And you're using load as a measure of how many
> > fragments it comes apart in?
>
> no
>
> what I'm saying is that if you have a workload where you have 10 msec of
> work, then 1 msec of idle, then 10 msec of work, 1 msec of idle etc etc,
> it is very different from 100 msec of work, 10 msec of idle, 100 msec of
> work, even though utilization is the same.

Sure..

> what the logic is trying to do, on a 10 km level, is to limit the damage
> of accumulated C state exit time.
> (I'll avoid the word "latency" here, since the real time people will
> then immediately think this is about controlling latency response, which
> it isn't)

But why? There's a natural limit to his, say the wakeup costs 0.2ms then
you can only do 5k of those a second. Once you need to actually do some
work as well this comes down.

But its all idle time, you cannot be idle longer than there is a lack of
work. So if you're idle too long (because of long exit latency) your
work shifts and the future idle time reduces, eventually causing a lower
C state to be used.

Also, when you notice you're waking up too soon, you can quickly ramp
down on the C state levels.

> Now, if you're very idle for a sustained duration (e.g. low load),
> you're assumed not sensitive to a bit of performance cost.
> but if you're actually busy (over a longer period, not just "right
> now"), you're assumed to be sensitive to the performance cost,
> and what the algorithm does is make it less easy to go into the
> expensive states.

My brain still sparks and fizzles when I read that.. it just doesn't
compute.

What performance? performance isn't a well defined word.

> the closest metric we have right now to "sensitive to performance cost"
> that I know of is "load average". If the scheduler has a better metric,
> I'd be more than happy to switch the idle selection code over to it...

I can't suggest anything better for something I've still no clue about.
You're completely failing to explain this thing to me.

> note that the idle selection code has 3 metrics, this is only one of them:
> 1. PM_QOS latency tolerance
> 2. Energy break even
> 3. Performance tolerance

That 3rd, I'm completely failing to understand.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/