Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

From: Rafael J. Wysocki
Date: Mon Jul 27 2020 - 13:23:40 EST


On Wednesday, July 22, 2020 1:14:42 AM CEST Francisco Jerez wrote:
>
> --==-=-=
> Content-Type: multipart/mixed; boundary="=-=-="
>
> --=-=-=
> Content-Type: text/plain; charset=utf-8
> Content-Disposition: inline
> Content-Transfer-Encoding: quoted-printable
>
> Srinivas Pandruvada <srinivas.pandruvada@xxxxxxxxxxxxxxx> writes:
>
> > On Mon, 2020-07-20 at 16:20 -0700, Francisco Jerez wrote:
> >> "Rafael J. Wysocki" <rafael@xxxxxxxxxx> writes:
> >>=20
> >> > On Fri, Jul 17, 2020 at 2:21 AM Francisco Jerez <
> >> > currojerez@xxxxxxxxxx> wrote:
> >> > > "Rafael J. Wysocki" <rafael@xxxxxxxxxx> writes:
> >> > >=20
> > {...]
> >
> >> > Overall, so far, I'm seeing a claim that the CPU subsystem can be
> >> > made
> >> > use less energy and do as much work as before (which is what
> >> > improving
> >> > the energy-efficiency means in general) if the maximum frequency of
> >> > CPUs is limited in a clever way.
> >> >=20
> >> > I'm failing to see what that clever way is, though.
> >> Hopefully the clarifications above help some.
> >
> > To simplify:
> >
> > Suppose I called a function numpy.multiply() to multiply two big arrays
> > and thread is a pegged to a CPU. Let's say it is causing CPU to
> > finish the job in 10ms and it is using a P-State of 0x20. But the same
> > job could have been done in 10ms even if it was using P-state of 0x16.
> > So we are not energy efficient. To really know where is the bottle neck
> > there are numbers of perf counters, may be cache was the issue, we
> > could rather raise the uncore frequency a little. A simple APRF,MPERF
> > counters are not enough.=20
>
> Yes, that's right, APERF and MPERF aren't sufficient to identify every
> kind of possible bottleneck, some visibility of the utilization of other
> subsystems is necessary in addition -- Like e.g the instrumentation
> introduced in my series to detect a GPU bottleneck. A bottleneck
> condition in an IO device can be communicated to CPUFREQ

It generally is not sufficient to communicate it to cpufreq. It needs to be
communicated to the CPU scheduler.

> by adjusting a
> PM QoS latency request (link [2] in my previous reply) that effectively
> gives the governor permission to rearrange CPU work arbitrarily within
> the specified time frame (which should be of the order of the natural
> latency of the IO device -- e.g. at least the rendering time of a frame
> for a GPU) in order to minimize energy usage.

OK, we need to talk more about this.

> > or we characterize the workload at different P-states and set limits.
> > I think this is not you want to say for energy efficiency with your
> > changes.=20
> >
> > The way you are trying to improve "performance" is by caller (device
> > driver) to say how important my job at hand. Here device driver suppose
> > offload this calculations to some GPU and can wait up to 10 ms, you
> > want to tell CPU to be slow. But the p-state driver at a movement
> > observes that there is a chance of overshoot of latency, it will
> > immediately ask for higher P-state. So you want P-state limits based on
> > the latency requirements of the caller. Since caller has more knowledge
> > of latency requirement, this allows other devices sharing the power
> > budget to get more or less power, and improve overall energy efficiency
> > as the combined performance of system is improved.
> > Is this correct?
>
> Yes, pretty much.

OK