Re: [PATCH] intel_pstate: track and export frequency residency stats via sysfs.

From: Anup Chenthamarakshan
Date: Wed Sep 10 2014 - 18:15:19 EST


On Wed, Sep 10, 2014 at 09:39:30AM -0700, Dirk Brandewie wrote:
> On 09/09/2014 04:22 PM, Anup Chenthamarakshan wrote:
> >On Tue, Sep 09, 2014 at 08:15:13AM -0700, Dirk Brandewie wrote:
> >>On 09/08/2014 05:10 PM, Anup Chenthamarakshan wrote:
> >>>Exported stats appear in
> >>><sysfs>/devices/system/cpu/intel_pstate/time_in_state as follows:
> >>>
> >>>## CPU 0
> >>>400000 3647
> >>>500000 24342
> >>>600000 144150
> >>>700000 202469
> >>>## CPU 1
> >>>400000 4813
> >>>500000 22628
> >>>600000 149564
> >>>700000 211885
> >>>800000 173890
> >>>
> >>>Signed-off-by: Anup Chenthamarakshan <anupc@xxxxxxxxxxxx>
> >>
> >>What is this information being used for?
> >
> >I'm using P-state residency information in power consumption tests to calculate
> >proportion of time spent in each P-state across all processors (one global set
> >of percentages, corresponding to each P-state). This is used to validate new
> >changes from the power perspective. Essentially, sanity checks to flag changes
> >with large difference in P-state residency.
> >
> >So far, we've been using the data exported by acpi-cpufreq to track this.
> >
> >>
> >>Tracking the current P state request for each core is only part of the
> >>story. The processor aggregates the requests from all cores and then decides
> >>what frequency the package will run at, this evaluation happens at ~1ms time
> >>frame. If a core is idle then it loses its vote for that package frequency will
> >>be and its frequency will be zero even though it may have been requesting
> >>a high P state when it went idle. Tracking the residency of the requested
> >>P state doesn't provide much useful information other than ensuring the the
> >>requests are changing over time IMHO.
> >
> >This is exactly why we're trying to track it.
>
> My point is that you are tracking the residency of the request and not
> the P state the package was running at. On a lightly loaded system
> it is not unusual for a core that was very busy and requesting a high
> P state to go idle for several seconds. In this case that core would
> lose its vote for the package P state but the stats would show that
> the P state was high for a very long time when its real frequency
> was zero.

I see what you're saying. Requesting a p-state does not necessarily mean that is
the state the CPU is in.

>
> There are a couple of ways to get what I consider better information
> about what is actually going on.
>
> The current turbostat provides C state residency and calculates the
> average/effective frequency of the core over its sample time.
> Turbostat will also measure the power consumption from the CPU point
> of view if your processor supports the RAPL registers.
>
> Reading MSR 0x198 MSR_IA32_PERF_STATUS will tell you what the core
> would run at if it not idle, this reflects the decision that the
> package made based on current requests.
>
> Using perf to collect power:pstate_sample event will give information
> about each sample on the core and give you timestamps to detect idle
> times.
>
> Using perf to collect power:cpu_frequency will show when the P state
> request was changed on each core and is triggered by intel_pstate and
> acpi_cpufreq.
>
> Powertop collects that same information as turbostat and a bunch of
> other information useful in seeing where you could be burning power
> for no good reason.
>
> For getting an idea of real power turbostat is the easiest to use and
> is available on most systems. Using perf will give you a very fine grained
> view of what is going on as well as point to the culprit for bad
> behaviour in most cases.

Tools like powertop and turbostat are not present by default on all systems,
so it is not always possible to use them :(

Will it make sense to expose the current (64-bit) value of aperf and mperf
through sysfs? This will let userspace tools calculate the average frequency
of a CPU across a large period of time. For example, a load test that runs for
1 hour will only need to poll sysfs twice (per CPU) to do this operation,
instead of polling MSRs on each CPU once every second or so (to account for
overruns).

>
> >
> >>
> >>This interface will not be supportable with upcoming processors using
> >>hardware P states as documented in volume 3 of the current SDM Section 14.4
> >>http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
> >>The OS will have no way of knowing what the P state requests are for a
> >>given core are.
> >
> >Will there be any means to determine the proportion of time spent in different
> >HWP-states when HWP gets enabled (maybe at a package level)?
> >
> Not that I am aware of :-( There is MSR_PPERF section 14.4.5.1 that will give
> the CPUs view of the amount of productive work/scalability of the current load.
>
> --Dirk
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/