[PATCH] intel_idle: Adjust the SKX C6 latency and residency if PC6 is disabled

From: Chen Yu
Date: Thu May 27 2021 - 00:52:17 EST


Currently cpuidle assumes worst-case C-state parameters, and so C6
is described with PC6 parameters, which is worst case for requesting
CC6. When PC6 is enabled, this is appropriate. But if PC6 is disabled
in BIOS, the exit latency and target_residency should be adjusted
accordingly.

Exit latency:
The C6 exit latency is measured when woken up from CC6/PC6. In the past,
if PC6 is disabled, CPU would be demoted to CC6/PC3, which is close to
the latency from CC6/PC6 and there is no need to update the C6 exit latency.
However on newer platform there is no CC3/PC3 anymore, then the C6 exit
latency with PC6 disabled should be CC6/PC0.

Target residency:
With PC6 disabled and C3/PC3 supported, the OS requests C3 if idle
duration is within [CC6, PC6) target_residency. On new CPU generations
with C3/PC3 deprecated, the OS would request C1E. This would cause
low energy-efficiency. In summary, the question is, should we lower
the bar to request C6 when PC6 is disabled? The answer is yes.

To fill this gap, check if PC6 is disabled in the BIOS in the
MSR_PKG_CST_CONFIG_CONTROL(0xe2). If so, use CC6/PC0 parameters as the
new exit latency. Meanwhile, update target_residency to 3 times of the new
exit latency. This is consistent with how intel_idle driver uses _CST to
calculate the target_residency. The consequence is that, the OS would
be more offen to choose C6 over C1E when PC6 is disabled. The new exit
latency of CC6/PC0 was from wult[1] result, which was measured via NIC
wakeup from 99.99th latency.

Before the change:
PC6 enabled Y N N
has C3/PC3 N N Y
idle duration [CC6,PC6) [CC6,PC6) [CC6,PC6)
CPU request C1E C1E C3

After the change:
PC6 enabled Y N N
has C3/PC3 N N Y
idle duration [CC6,PC6) [CC6,PC6) [CC6,PC6)
CPU request C1E *C6* C3

*C6* rather than C1E is chosen.

There is concern that if we should introduce a more generic solution
rather than optimizing on each platforms. However consider the
code complexity and different PC6 bit interpretation on different
platforms, tune the code per platform seems to be an acceptable trade-off.

[1] https://intel.github.io/wult/

Suggested-by: Len Brown <len.brown@xxxxxxxxx>
Signed-off-by: Chen Yu <yu.c.chen@xxxxxxxxx>
---
drivers/idle/intel_idle.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)

diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index ec1b9d306ba6..e6c543b5ee1d 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -1484,6 +1484,36 @@ static void __init sklh_idle_state_table_update(void)
skl_cstates[6].flags |= CPUIDLE_FLAG_UNUSABLE; /* C9-SKL */
}

+/**
+ * skx_idle_state_table_update - Adjust the Sky Lake/Cascade Lake
+ * idle states table.
+ */
+static void __init skx_idle_state_table_update(void)
+{
+ unsigned long long msr;
+
+ rdmsrl(MSR_PKG_CST_CONFIG_CONTROL, msr);
+
+ /*
+ * 000b: C0/C1 (no package C-state support)
+ * 001b: C2
+ * 010b: C6 (non-retention)
+ * 011b: C6 (retention)
+ * 111b: No Package C state limits.
+ */
+ if ((msr & 0x7) < 2) {
+ /*
+ * Uses the CC6 + PC0 latency and 3 times of
+ * latency for target_residency if the PC6
+ * is disabled in BIOS. This is consistent
+ * with how intel_idle driver uses _CST
+ * to set the target_residency.
+ */
+ skx_cstates[2].exit_latency = 92;
+ skx_cstates[2].target_residency = 276;
+ }
+}
+
static bool __init intel_idle_verify_cstate(unsigned int mwait_hint)
{
unsigned int mwait_cstate = MWAIT_HINT2CSTATE(mwait_hint) + 1;
@@ -1515,6 +1545,9 @@ static void __init intel_idle_init_cstates_icpu(struct cpuidle_driver *drv)
case INTEL_FAM6_SKYLAKE:
sklh_idle_state_table_update();
break;
+ case INTEL_FAM6_SKYLAKE_X:
+ skx_idle_state_table_update();
+ break;
}

for (cstate = 0; cstate < CPUIDLE_STATE_MAX; ++cstate) {
--
2.25.1