Re: [RFCv2 PATCH 01/23] sched: Documentation for scheduler energy cost model

From: Rafael J. Wysocki
Date: Thu Jul 24 2014 - 10:10:10 EST


On Thursday, July 24, 2014 09:26:09 AM Peter Zijlstra wrote:
> On Thu, Jul 24, 2014 at 02:53:20AM +0200, Rafael J. Wysocki wrote:
> > I am used to slightly different terminology here. Namely, there are voltage
> > domains (parts sharing a voltage rail or a voltage regulator, such that you
> > can only apply/remove/change voltage to all of them at the same time) and clock
> > domains (analogously, but for clocks). A power domain (which in your description
> > above seems to correspond to a voltage domain) may be a voltage domain, a clock
> > domain or a combination thereof.
> >
> > In addition to that, in a voltage domain it may be possible to apply many
> > different levels of voltage, which case doesn't seem to be covered at all by
> > the above (or I'm missing something).
> >
> > Also a P-state is not just a frequency level, but a combination of frequency
> > and voltage that has to be applied for that frequency to be stable. You may
> > regard them as Operation Performance Points of the CPU, but that very well may
> > go beyond frequencies and voltages. Thus it actually is better not to talk
> > about P-states as "frequencies".
> >
> > Now, P-states may or may not have to be coordinated between all CPUs in a
> > package (cluster), by hardware or software, such that all CPUs in a cluster
> > need to be kept in the same P-state. That you can regard as a "P-state
> > domain", but it usually means a specific combination of voltage and frequency.
>
> I think Morton is aware of this, but for the sake of sanity dropped the
> whole lot into something simpler (while hoping reality would not ruin
> his life).
>
> > C-states in turn are states in which CPUs don't execute instructions.
> > That need not mean the removal of voltage or even frequency from them.
> > Of course, they do mean some sort of power draw reduction, but that may
> > be achieved in many different ways. Some C-states require coordination
> > too (for example, a single C-state may apply to a whole package or cluster
> > at the same time) and you can think about "domains" here too, but there
> > need not be a direct mapping to physical parameters such as the frequency
> > or the voltage.
>
> One thing that wasn't clear to me is if you allow for C-domain and
> P-domain to overlap or if they're always inclusive (where one is wholly
> contained in the other).

On the CPUs I worked with so far they were always inclusive. Previously, the
whole package was a P-state domain. Today some CPUs (Haswell server chips
for example) have per-core P-states.

> > Moreover, P-states and C-states may overlap. That is, a CPU may be in Px
> > and Cy at the same time, which means that after leaving Cy it will execute
> > instructions in Px. Things like leakage may depend on x in that case and
> > the total power draw may depend on the combination of x and y.
>
> Right, and I suppose the domain thing makes it impossible to drop to the
> lowest P state on going idle. Tricky that.

That's the case for older chips. I'm not sure about the newest lot entirely
to be honest, need to ask.

> > The concern is that if a scaling governor is running in parallel with the above
> > algorithm and it has its own utilization goal (it usually does), it may change
> > the P-state under you to match that utilization goal and you'll end up with
> > something different from what you expected.
> >
> > That may be addressed either by trying to predict what the scaling governor will
> > do (and good luck with that) or by taking care of P-states by yourself. The
> > latter would require changes to the algorithm I think, though.
>
> The idea was that we'll do P states ourselves based on these utilization
> figures. If we find we cannot fit the 'new' task into the current set
> without either raising P or waking an idle cpu (if at all available), we
> compute the cost of either option and pick the cheapest.

Yeah. One subtle thing is that ramping up P may affect the other guys
(if the whole chip is a P-domain, for example), but I guess that can be
taken into account.

--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/