Re: [RFCv2 PATCH 01/23] sched: Documentation for scheduler energy cost model

From: Peter Zijlstra
Date: Thu Jul 24 2014 - 03:26:37 EST

On Thu, Jul 24, 2014 at 02:53:20AM +0200, Rafael J. Wysocki wrote:
> I am used to slightly different terminology here. Namely, there are voltage
> domains (parts sharing a voltage rail or a voltage regulator, such that you
> can only apply/remove/change voltage to all of them at the same time) and clock
> domains (analogously, but for clocks). A power domain (which in your description
> above seems to correspond to a voltage domain) may be a voltage domain, a clock
> domain or a combination thereof.
> In addition to that, in a voltage domain it may be possible to apply many
> different levels of voltage, which case doesn't seem to be covered at all by
> the above (or I'm missing something).
> Also a P-state is not just a frequency level, but a combination of frequency
> and voltage that has to be applied for that frequency to be stable. You may
> regard them as Operation Performance Points of the CPU, but that very well may
> go beyond frequencies and voltages. Thus it actually is better not to talk
> about P-states as "frequencies".
> Now, P-states may or may not have to be coordinated between all CPUs in a
> package (cluster), by hardware or software, such that all CPUs in a cluster
> need to be kept in the same P-state. That you can regard as a "P-state
> domain", but it usually means a specific combination of voltage and frequency.

I think Morton is aware of this, but for the sake of sanity dropped the
whole lot into something simpler (while hoping reality would not ruin
his life).

> C-states in turn are states in which CPUs don't execute instructions.
> That need not mean the removal of voltage or even frequency from them.
> Of course, they do mean some sort of power draw reduction, but that may
> be achieved in many different ways. Some C-states require coordination
> too (for example, a single C-state may apply to a whole package or cluster
> at the same time) and you can think about "domains" here too, but there
> need not be a direct mapping to physical parameters such as the frequency
> or the voltage.

One thing that wasn't clear to me is if you allow for C-domain and
P-domain to overlap or if they're always inclusive (where one is wholly
contained in the other).

> Moreover, P-states and C-states may overlap. That is, a CPU may be in Px
> and Cy at the same time, which means that after leaving Cy it will execute
> instructions in Px. Things like leakage may depend on x in that case and
> the total power draw may depend on the combination of x and y.

Right, and I suppose the domain thing makes it impossible to drop to the
lowest P state on going idle. Tricky that.

> The concern is that if a scaling governor is running in parallel with the above
> algorithm and it has its own utilization goal (it usually does), it may change
> the P-state under you to match that utilization goal and you'll end up with
> something different from what you expected.
> That may be addressed either by trying to predict what the scaling governor will
> do (and good luck with that) or by taking care of P-states by yourself. The
> latter would require changes to the algorithm I think, though.

The idea was that we'll do P states ourselves based on these utilization
figures. If we find we cannot fit the 'new' task into the current set
without either raising P or waking an idle cpu (if at all available), we
compute the cost of either option and pick the cheapest.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at