Re: [RFC 0/3] Experimental patchset for CPPC
From: Ashwin Chaugule
Date: Thu Aug 14 2014 - 16:11:31 EST
+ Rafael [corrected email addr]
On 14 August 2014 15:57, Ashwin Chaugule <ashwin.chaugule@xxxxxxxxxx> wrote:
>
> Hello,
>
> Apologies in advance for a lengthy cover letter. Hopefully it has all the
> required information so you dont need to read the ACPI spec. ;)
>
> This patchset introduces the ideas behind CPPC (Collaborative Processor
> Performance Control) and implements support for controlling CPU performance
> using the existing PID (Proportional-Integral-Derivative) controller (from
> intel_pstate.c) and some CPPC semantics.
>
> The patchwork is not a final proposal of the CPPC implementation. I've had
> to hack some sections due to lack of hardware, details of which are in the
> Testing section.
>
> There are several bits of information which are needed in order to make CPPC
> work great on Linux based platforms and I'm hoping to start a wider discussion on
> how to address the missing bits. The following sections briefly introduce CPPC
> and later highlight the information which is missing.
>
> More importantly, I'm also looking for ideas on how to support CPPC in the short
> term, given that we will soon be seeing products based on ARM64 and X86 which
> support CPPC.[1] Although we may not have all the information, we could make it
> work with existing governors in a way this patchset demonstrates. Hopefully,
> this approach is acceptable for mainline inclusion in the short term.
>
> Finer details about the CPPC spec are available in the latest ACPI 5.1
> specification.[2]
>
> If these issues are being discussed on some other thread or elsewhere, or if
> someone is already working on it, please let me know. Also, please correct me if
> I have misunderstood anything.
>
> What is CPPC:
> =============
>
> CPPC is the new interface for CPU performance control between the OS and the
> platform defined in ACPI 5.0+. The interface is built on an abstract
> representation of CPU performance rather than raw frequency. Basic operation
> consists of:
>
> * Platform enumerates supported performance range to OS
>
> * OS requests desired performance level over some time window along
> with min and max instantaneous limits
>
> * Platform is free to optimize power/performance within bounds provided by OS
>
> * Platform provides telemetry back to OS on delivered performance
>
> Communication with the OS is abstracted via another ACPI construct called
> Platform Communication Channel (PCC) which is essentially a generic shared
> memory channel with doorbell interrupts going back and forth. This abstraction
> allows the âplatformâ for CPPC to be a variety of different entities â driver,
> firmware, BMC, etc.
>
> CPPC describes the following registers:
>
> * HighestPerformance: (read from platform)
>
> Indicates the highest level of performance the processor is theoretically
> capable of achieving, given ideal operating conditions.
>
> * Nominal Performance: (read from platform)
>
> Indicates the highest sustained performance level of the processor. This is the
> highest operating performance level the CPU is expected to deliver continuously.
>
> * LowestNonlinearPerformance: (read from platform)
>
> Indicates the lowest performance level of the processor with non- linear power
> savings.
>
> * LowestPerformance: (read from platform)
>
> Indicates the lowest performance level of the processor.
>
> * GuaranteedPerformanceRegister: (read from platform)
>
> Optional. If supported, contains register to read the current guaranteed
> performance from. This is current max sustained performance of the CPU taking
> into account all budgeting constraints. This can change at runtime and is
> notified to the OS via ACPI notification mechanisms.
>
> * DesiredPerformanceRegister: (write to platform)
>
> Register to write desired performance level from the OS.
>
> * MinimumPerformanceRegister: (write to platform)
>
> Optional. This is the min allowable performance as requested by the OS.
>
> * MaximumPerformanceRegister: (write to platform)
>
> Optional. This is the max allowable performance as requested by the OS.
>
> * PerformanceReductionToleranceRegister (write to platform)
>
> Optional. This is the deviation below the desired perf value as requested by the
> OS. If the Time window register(below) is supported, then this value is the min
> performance on average over the time window that the OS desires.
>
> * TimeWindowRegister: (write to platform)
> Optional. The OS requests desired performance over this time window.
>
> * CounterWraparoundTime: (read from platform)
> Optional. Min time before the performance counters wrap around.
>
> * ReferencePerformanceCounterRegister: (read from platform)
>
> A counter that increments proportionally to the reference performance of the
> processor.
>
> * DeliveredPerformanceCounterRegister: (read from platform)
>
> Delivered perf = reference perf * delta(delivered perf ctr)/delta(ref perf ctr)
>
> * PerformanceLimitedRegister: (read from platform)
>
> This is set by the platform in the event that it has to limit available
> performance due to thermal or budgeting constraints.
>
> * CPPCEnableRegister: (read/write from platform)
>
> Enable/disable CPPC
>
> * AutonomousSelectionEnable:
>
> Platform decides CPU performance level w/o OS assist.
>
> * AutonomousActivityWindowRegister:
>
> This influences the increase or decrease in cpu performance of the platforms
> autonomous selection policy.
>
> * EnergyPerformancePreferenceRegister:
>
> Provides a energy or perf bias hint to the platform when in autonomous mode.
>
> * Reference Performance: (read from platform)
>
> Indicates the rate at which the reference counter increments.
>
>
> Whats missing in CPPC:
> =====================
>
> Currently CPPC makes no mention of power. However, this could be added in future
> versions of the spec.
> e.g. although CPPC works off of a continuous range of CPU perf levels, we could
> discretize the scale such that we only extract points where the power level changes
> substantially between CPU perf levels and export this information to the
> scheduler.
>
> Whats missing in the kernel:
> ============================
>
> We may have some of this information in the scheduler, but I couldn't see a good way
> to extract it for CPPC yet.
>
> (1) An intelligent way to provide a min/max bound and a desired value for CPU
> performance.
>
> (2) A timing window for the platform to deliver requested performance within
> bounds. This could be a kind of sampling interval between consecutive reads of
> delivered cpu performance.
>
> (3) Centralized decision making by any CPU in a freq domain for all its
> siblings.
>
> The last point needs some elaboration:
>
> I see that the CPUfreq layer allows defining "related CPUs" and that we can have
> the same policy for CPUs in the same freq domain and one governor per policy.
> However, from what I could tell, there are at least 2 baked in assumptions in
> this layer which break things at least for platforms like ARM (Please correct me
> if I'm wrong!)
>
> (a) All CPUs run at the exact same max, min and cur freq.
>
> (b) Any CPU always gets exactly the freq it asked for.
>
> So, although the CPUFreq layer is capable of making somewhat centralized cpufreq
> decisions for CPUs under the same policy, it seems to be deciding things under
> the wrong/inapplicable assumptions. Moreover only one CPU is in charge of
> policy handling at a time and the policy handling is shifted to another CPU in the
> domain, only if the former CPU is hotplugged out.
>
> Not having a proper centralized decision maker adversely affects power saving
> possibilities in platforms that can't distinguish when a CPU requests a specific
> freq and then goes to sleep. This potentially has the effect of keeping other
> CPUs in the domain running at a much higher frequency than required, while the
> initial requester is deep asleep.
>
> So, for point (3), I'm not sure which path we should take among the following:
>
> (I) Fix cpufreq layer and add CPPC support as a cpufreq_driver. (a) Change
> every call to get freq to make it read h/w registers and then snap value back to
> freq table. This way, cpufreq can keep its idea of freq current. However, this
> may end up waking CPUs to read counters, unless they are mem mapped. (b) Allow
> any CPU in the "related_cpus" mask to make policy decisions on behalf of
> siblings. So the policy maker switching is not tied to hotplug.
>
> (II) Not touch CPUfreq and use the PID algorithm instead, but change the busyness
> calculation to accumulate busyness values from all CPUs in common domain.
> Requires implementation of domain awareness.
>
> (III) Address these issues in the upcoming CPUfreq/CPUidle integration layer(?)
>
> (IV) Handle it in the platform or lose out. I understand this has some potential
> for adding latency to cpu freq requests so it may not be possible for all
> platforms.
>
> (V) ..?
>
> For points (1) and (2), the long term solution IMHO is to work it out along with the
> scheduler CPUFreq/CPUidle integration. But its not clear to me what would be
> the best short term approach. I'd greatly appreciate any suggestions/comments.
> If anyone is already working on these issues, please CC me as well.
>
> Test setup:
> ==========
>
> For the sake of experiments, I used the Thinkpad x240 laptop, which advertises
> CPPC tables in its ACPI firmware. The PCC and CPPC drivers included in this
> patchset are able to parse the tables and get all the required addresses.
> However, it seems that this laptop doesn't implement PCC doorbell and the
> firmware side of CPPC. The PCC doorbell calls would just wait forever. Not sure
> whats going on there. So, I had to hack it and emulate what the platform
> would've done to some extent.
>
> I extracted the PID algo from intel_pstate.c and modified it with CPPC function
> wrappers. It shouldn't be hard to replace PID with anything else we think is
> suitable. In the long term, I hope we can make CPPC calls directly from the
> scheduler.
>
> There are two versions of the low level CPPC accessors. The one included in the
> patchset is how I'd imagine it would work with platforms that completely
> implement CPPC in firmware.
>
> The other version is here [5]. This should help with DT or platforms with broken
> firmware, enablement purposes etc.
>
> I ran a simple kernel compilation with intel_pstate.c and the CPPC modified
> version as the governors and saw no real difference in compile times. So no new
> overheads added.
> I verified that CPU freq requests were taken by reading out the PERF_STATUS register.
>
> [1] - See the HWP section 14.4 http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.pdf
> [2] - http://www.uefi.org/sites/default/files/resources/ACPI_5_1release.pdf
> [3] - https://plus.google.com/+TheodoreTso/posts/2vEekAsG2QT
> [4] - https://plus.google.com/+ArjanvandeVen/posts/dLn9T4ehywL
> [5] - http://git.linaro.org/people/ashwin.chaugule/leg-kernel.git/blob/236d901d31fb06fda798880c9ca09d65123c5dd9:/drivers/cpufreq/cppc_x86.c
>
> Ashwin Chaugule (3):
> ACPI: Add support for Platform Communication Channel
> CPPC: Add support for Collaborative Processor Performance Control
> CPPC: Add ACPI accessors to CPC registers
>
> drivers/acpi/Kconfig | 10 +
> drivers/acpi/Makefile | 1 +
> drivers/acpi/pcc.c | 301 +++++++++++++++
> drivers/cpufreq/Kconfig | 19 +
> drivers/cpufreq/Makefile | 2 +
> drivers/cpufreq/cppc.c | 874 ++++++++++++++++++++++++++++++++++++++++++++
> drivers/cpufreq/cppc.h | 181 +++++++++
> drivers/cpufreq/cppc_acpi.c | 80 ++++
> 8 files changed, 1468 insertions(+)
> create mode 100644 drivers/acpi/pcc.c
> create mode 100644 drivers/cpufreq/cppc.c
> create mode 100644 drivers/cpufreq/cppc.h
> create mode 100644 drivers/cpufreq/cppc_acpi.c
>
> --
> 1.9.1
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/