Re: [PATCH v5 5/5] intel_idle: Add S0ix validation

From: dbasehore .
Date: Mon Jul 10 2017 - 17:57:59 EST


On Mon, Jul 10, 2017 at 6:33 AM, Rafael J. Wysocki <rjw@xxxxxxxxxxxxx> wrote:
> On Friday, July 07, 2017 05:03:03 PM Derek Basehore wrote:
>> This adds validation of S0ix entry and enables it on Skylake. Using
>> the new tick_set_freeze_event function, we program the CPU to wake up
>> X seconds after entering freeze. After X seconds, it will wake the CPU
>> to check the S0ix residency counters and make sure we entered the
>> lowest power state for suspend-to-idle.
>>
>> It exits freeze and reports an error to userspace when the SoC does
>> not enter S0ix on suspend-to-idle.
>>
>
> Honestly, I'm totally unsure about this ATM, as it seems to assume that it
> doesn't make senes to stay suspended if SLP_S0 residency is not there, but
> that totally may not be the case.
>
> First of all, there are systems in which SLP_S0 is related to about 10%-20% of
> the total power draw reduction whereas the remaining 80%-90% comes from PC10
> alone. So, if you can get to PC10, being unable to get SLP_S0 residency on top
> of that may not be a big deal after all. Of course, you may argue that 10%-20%
> of battery life while suspended is "a lot", but that really depends on the
> possible alternatives.
>

We'd have to track actual PC10 residency instead of checking if it's
the requested state since the SoC can enter a higher power package
cstate even if PC10 is requested. I think this can be done by reading
an msr register, though. Is there an example of how PC10 can be
entered without SLP_S0 getting asserted by the way?

Also, this feature is disabled by default, so it doesn't prevent these
use cases.

> Second, as far as the alternatives go, it may not be rosy, because there are
> systems that don't support S3 (or any other ACPI sleep states at all for that
> matter) and suspend-to-idle is the only suspend mechanism available there.
> On those systems it still may make sense to use it even though it may not
> reduce the power draw that much. And from some experiments, suspend-to-idle
> still extends battery life by 100% over runtime idle even if the system is not
> able to get to PC10 most of the time.

This is off by default.

>
> While I understand the use case, I don't think it is a binary "yes"-"no" thing
> and the focus on just SLP_S0 may be misguided.

Do you have a preference such as being able to set the level that you
want to validate to? For instance, there could be an option to check
that SLP_So is asserted, but there could also be an option to check
for PC9 or PC10 residency. For instance, there could be a module
parameters for setting the validated state:

available_suspend_to_idle_states:
"none pc6 pc9 pc10 slp_s0"

max_suspend_to_idle_state:
"none"

Where the default validated state is none, but it can be set to any of
the states in available_suspend_to_idle_states

>
>> One example of a bug that can prevent a Skylake CPU from entering S0ix
>> (suspend-to-idle) is a leaked reference count to one of the i915 power
>
> Suspend-to-idle is not S0ix. Suspend-to-idle is the state the kernel puts the
> system into after echoing "freeze" to /sys/power/state and S0ix is a platform
> power state that may or may not be entered as a result of that.
>
>> wells. The CPU will not be able to enter Package C10 and will
>> therefore use about 4x as much power for the entire system. The issue
>> is not specific to the i915 power wells though.
>
> Well, fair enough, but what if the SoC can enter PC10, but without SLP_S0
> residency on top of it?
>
>> Signed-off-by: Derek Basehore <dbasehore@xxxxxxxxxxxx>
>> ---
>> drivers/idle/intel_idle.c | 142 +++++++++++++++++++++++++++++++++++++++++++---
>> 1 file changed, 134 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
>> index ebed3f804291..d38621da6e54 100644
>> --- a/drivers/idle/intel_idle.c
>> +++ b/drivers/idle/intel_idle.c
>> @@ -61,10 +61,12 @@
>> #include <linux/notifier.h>
>> #include <linux/cpu.h>
>> #include <linux/moduleparam.h>
>> +#include <linux/suspend.h>
>> #include <asm/cpu_device_id.h>
>> #include <asm/intel-family.h>
>> #include <asm/mwait.h>
>> #include <asm/msr.h>
>> +#include <asm/pmc_core.h>
>>
>> #define INTEL_IDLE_VERSION "0.4.1"
>>
>> @@ -93,12 +95,29 @@ struct idle_cpu {
>> bool disable_promotion_to_c1e;
>> };
>>
>> +/*
>> + * The limit for the exponential backoff for the freeze duration. At this point,
>> + * power impact is is far from measurable. It's about 3uW based on scaling from
>> + * waking up 10 times a second.
>> + */
>> +#define MAX_SLP_S0_SECONDS 1000
>> +#define SLP_S0_EXP_BASE 10
>> +
>> +static bool slp_s0_check;
>> +static unsigned int slp_s0_seconds;
>> +
>> +static DEFINE_SPINLOCK(slp_s0_check_lock);
>> +static unsigned int slp_s0_num_cpus;
>> +static bool slp_s0_check_inprogress;
>> +
>> static const struct idle_cpu *icpu;
>> static struct cpuidle_device __percpu *intel_idle_cpuidle_devices;
>> static int intel_idle(struct cpuidle_device *dev,
>> struct cpuidle_driver *drv, int index);
>> static int intel_idle_freeze(struct cpuidle_device *dev,
>> struct cpuidle_driver *drv, int index);
>> +static int intel_idle_freeze_and_check(struct cpuidle_device *dev,
>> + struct cpuidle_driver *drv, int index);
>> static struct cpuidle_state *cpuidle_state_table;
>>
>> /*
>> @@ -597,7 +616,7 @@ static struct cpuidle_state skl_cstates[] = {
>> .exit_latency = 2,
>> .target_residency = 2,
>> .enter = &intel_idle,
>> - .enter_freeze = intel_idle_freeze, },
>> + .enter_freeze = intel_idle_freeze_and_check, },
>
> Why do you do this for anything lower than C6?

In that case, it's probably best the fail in these cases if the check
is enabled. The CPU can't enter a lower cstate than the hinted one,
correct?

>
>> {
>> .name = "C1E",
>> .desc = "MWAIT 0x01",
>> @@ -605,7 +624,7 @@ static struct cpuidle_state skl_cstates[] = {
>> .exit_latency = 10,
>> .target_residency = 20,
>> .enter = &intel_idle,
>> - .enter_freeze = intel_idle_freeze, },
>> + .enter_freeze = intel_idle_freeze_and_check, },
>> {
>> .name = "C3",
>> .desc = "MWAIT 0x10",
>> @@ -613,7 +632,7 @@ static struct cpuidle_state skl_cstates[] = {
>> .exit_latency = 70,
>> .target_residency = 100,
>> .enter = &intel_idle,
>> - .enter_freeze = intel_idle_freeze, },
>> + .enter_freeze = intel_idle_freeze_and_check, },
>> {
>> .name = "C6",
>> .desc = "MWAIT 0x20",
>> @@ -621,7 +640,7 @@ static struct cpuidle_state skl_cstates[] = {
>> .exit_latency = 85,
>> .target_residency = 200,
>> .enter = &intel_idle,
>> - .enter_freeze = intel_idle_freeze, },
>> + .enter_freeze = intel_idle_freeze_and_check, },
>> {
>> .name = "C7s",
>> .desc = "MWAIT 0x33",
>> @@ -629,7 +648,7 @@ static struct cpuidle_state skl_cstates[] = {
>> .exit_latency = 124,
>> .target_residency = 800,
>> .enter = &intel_idle,
>> - .enter_freeze = intel_idle_freeze, },
>> + .enter_freeze = intel_idle_freeze_and_check, },
>> {
>> .name = "C8",
>> .desc = "MWAIT 0x40",
>> @@ -637,7 +656,7 @@ static struct cpuidle_state skl_cstates[] = {
>> .exit_latency = 200,
>> .target_residency = 800,
>> .enter = &intel_idle,
>> - .enter_freeze = intel_idle_freeze, },
>> + .enter_freeze = intel_idle_freeze_and_check, },
>> {
>> .name = "C9",
>> .desc = "MWAIT 0x50",
>> @@ -645,7 +664,7 @@ static struct cpuidle_state skl_cstates[] = {
>> .exit_latency = 480,
>> .target_residency = 5000,
>> .enter = &intel_idle,
>> - .enter_freeze = intel_idle_freeze, },
>> + .enter_freeze = intel_idle_freeze_and_check, },
>> {
>> .name = "C10",
>> .desc = "MWAIT 0x60",
>> @@ -653,7 +672,7 @@ static struct cpuidle_state skl_cstates[] = {
>> .exit_latency = 890,
>> .target_residency = 5000,
>> .enter = &intel_idle,
>> - .enter_freeze = intel_idle_freeze, },
>> + .enter_freeze = intel_idle_freeze_and_check, },
>> {
>> .enter = NULL }
>> };
>> @@ -940,6 +959,8 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev,
>> * @dev: cpuidle_device
>> * @drv: cpuidle driver
>> * @index: state index
>> + *
>> + * @return 0 for success, no failure state
>> */
>> static int intel_idle_freeze(struct cpuidle_device *dev,
>> struct cpuidle_driver *drv, int index)
>> @@ -952,6 +973,101 @@ static int intel_idle_freeze(struct cpuidle_device *dev,
>> return 0;
>> }
>>
>> +static int check_slp_s0(u32 slp_s0_saved_count)
>> +{
>> + u32 slp_s0_new_count;
>> +
>> + if (intel_pmc_slp_s0_counter_read(&slp_s0_new_count)) {
>> + pr_warn("Unable to read SLP S0 residency counter\n");
>> + return -EIO;
>> + }
>> +
>> + if (slp_s0_saved_count == slp_s0_new_count) {
>> + pr_warn("CPU did not enter SLP S0 for suspend-to-idle.\n");
>> + return -EIO;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +/**
>> + * intel_idle_freeze_and_check - enters suspend-to-idle and validates the power
>> + * state
>> + *
>> + * This function enters suspend-to-idle with intel_idle_freeze, but also sets up
>> + * a timer to check that S0ix (low power state for suspend-to-idle on Intel
>> + * CPUs) is properly entered.
>> + *
>> + * @dev: cpuidle_device
>> + * @drv: cpuidle_driver
>> + * @index: state index
>> + * @return 0 for success, -EERROR if S0ix was not entered.
>> + */
>> +static int intel_idle_freeze_and_check(struct cpuidle_device *dev,
>> + struct cpuidle_driver *drv, int index)
>> +{
>> + bool check_on_this_cpu = false;
>> + u32 slp_s0_saved_count;
>> + unsigned long flags;
>> + int cpu = smp_processor_id();
>> + int ret;
>> +
>> + /* The last CPU to freeze sets up checking SLP S0 assertion. */
>> + spin_lock_irqsave(&slp_s0_check_lock, flags);
>> + slp_s0_num_cpus++;
>> + if (slp_s0_seconds &&
>> + slp_s0_num_cpus == num_online_cpus() &&
>> + !slp_s0_check_inprogress &&
>> + !intel_pmc_slp_s0_counter_read(&slp_s0_saved_count)) {
>> + ret = tick_set_freeze_event(cpu, ktime_set(slp_s0_seconds, 0));
>> + if (ret < 0) {
>> + spin_unlock_irqrestore(&slp_s0_check_lock, flags);
>> + goto out;
>> + }
>> +
>> + /*
>> + * Make sure check_slp_s0 isn't scheduled on another CPU if it
>> + * were to leave freeze and enter it again before this CPU
>> + * leaves freeze.
>> + */
>> + slp_s0_check_inprogress = true;
>> + check_on_this_cpu = true;
>> + }
>> + spin_unlock_irqrestore(&slp_s0_check_lock, flags);
>> +
>> + ret = intel_idle_freeze(dev, drv, index);
>> + if (ret < 0)
>> + goto out;
>> +
>> + if (check_on_this_cpu && tick_clear_freeze_event(cpu))
>> + ret = check_slp_s0(slp_s0_saved_count);
>> +
>> +out:
>> + spin_lock_irqsave(&slp_s0_check_lock, flags);
>> + if (check_on_this_cpu) {
>> + slp_s0_check_inprogress = false;
>> + slp_s0_seconds = min_t(unsigned int,
>> + SLP_S0_EXP_BASE * slp_s0_seconds,
>> + MAX_SLP_S0_SECONDS);
>> + }
>> + slp_s0_num_cpus--;
>> + spin_unlock_irqrestore(&slp_s0_check_lock, flags);
>> + return ret;
>> +}
>> +
>> +static int slp_s0_check_prepare(struct notifier_block *nb, unsigned long action,
>> + void *data)
>> +{
>> + if (action == PM_SUSPEND_PREPARE)
>> + slp_s0_seconds = slp_s0_check ? 1 : 0;
>> +
>> + return NOTIFY_DONE;
>> +}
>> +
>> +static struct notifier_block intel_slp_s0_check_nb = {
>> + .notifier_call = slp_s0_check_prepare,
>> +};
>> +
>> static void __setup_broadcast_timer(bool on)
>> {
>> if (on)
>> @@ -1454,6 +1570,13 @@ static int __init intel_idle_init(void)
>> goto init_driver_fail;
>> }
>>
>> + retval = register_pm_notifier(&intel_slp_s0_check_nb);
>> + if (retval) {
>> + free_percpu(intel_idle_cpuidle_devices);
>> + cpuidle_unregister_driver(&intel_idle_driver);
>> + goto pm_nb_fail;
>> + }
>> +
>> if (boot_cpu_has(X86_FEATURE_ARAT)) /* Always Reliable APIC Timer */
>> lapic_timer_reliable_states = LAPIC_TIMER_ALWAYS_RELIABLE;
>>
>> @@ -1469,6 +1592,8 @@ static int __init intel_idle_init(void)
>>
>> hp_setup_fail:
>> intel_idle_cpuidle_devices_uninit();
>> + unregister_pm_notifier(&intel_slp_s0_check_nb);
>> +pm_nb_fail:
>> cpuidle_unregister_driver(&intel_idle_driver);
>> init_driver_fail:
>> free_percpu(intel_idle_cpuidle_devices);
>> @@ -1484,3 +1609,4 @@ device_initcall(intel_idle_init);
>> * is the easiest way (currently) to continue doing that.
>> */
>> module_param(max_cstate, int, 0444);
>> +module_param(slp_s0_check, bool, 0644);
>
> This has to be documented somehow.
>
> Also, if it is not set, there is a useless overhead every time
> intel_idle_freeze_and_check() is called. It looks like you could use
> a static key or similar to avoid that.
>
> Moreover, the notifier is not necessary then as well.
>
> Thanks,
> Rafael
>