Re: power-efficient scheduling design

From: Catalin Marinas
Date: Tue Jun 18 2013 - 15:06:34 EST

On Tue, Jun 18, 2013 at 04:20:28PM +0100, Arjan van de Ven wrote:
> On 6/14/2013 9:05 AM, Morten Rasmussen wrote:
> > Looking at the discussion it seems that people have slightly different
> > views, but most agree that the goal is an integrated scheduling,
> > frequency, and idle policy like you pointed out from the beginning.
> ... except that such a solution does not really work for Intel hardware.

I think it can work (see below).

> The OS does not get to really pick the CPU "frequency" (never mind that
> frequency is not what gets controlled), the hardware picks the frequency.
> The OS can do some level of requests (best to think of this as a percentage
> more than frequency) but what you actually get is more often than not
> what you asked for.

Morten's proposal does not try to "pick" a frequency. The P-state change
is still done gradually based on the load (so we still have an adaptive
loop). The load (total or per-task) can be tracked in an arch-specific
way (using aperf/mperf on x86).

The difference from what intel_pstate.c does now is that it has a view
of the total load (across all CPUs) and the run-queue content. It can
"guide" the load balancer into favouring one or two CPUs and ignoring
the rest (using cpu_power).

If several CPUs have small aperf/mperf ratio, it can decide to use fewer
CPUs at a higher aperf/mperf by telling the load balancer not to use
them (cpu_power = 1). All of this is continuously re-adjusted to cope
with changes in the load and hardware variations like turbo boost.

Similarly, if a CPU has aperf/mperf >= 1, it keeps increasing the
P-state (depending on the policy). Once it got to the highest level,
depending on the number of threads in the run-queue (doesn't make sense
for only one), it can open up other CPUs and let the load balancer use

> You can look in hindsight what kind of performance you got (from some basic
> counters in MSRs), and the scheduler can use that to account backwards to what some process
> got. But to predict what you will get in the future...... that's near impossible
> on any realistic system nowadays (and even more so in the future).

We don't need absolute figures matching load to P-states but we'll
continue with an adaptive system. What we have now is also an adaptive
system but with independent decisions taken by the load balancer and the
P-state driver. The load balancer can even get confused by the cpufreq
decisions and move tasks around unnecessarily. With Morten's proposal we
get the power scheduler to adjust the P-state while giving hints to the
load balancer at the same time (it adjusts both, it doesn't try to
re-adjust itself after the load balancer).

> Treating "frequency" (well "performance) and idle separately is also a false thing to do
> (yes I know in 3.9/3.10 we still do that for Intel hw, but we're working
> on fixing that). They are by no means separate things. One guy's idle state
> is the other guys power budget (and thus performance)!.

I agree.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at