Re: [PATCH v5 5/5] intel_idle: Add S0ix validation

From: Rafael J. Wysocki
Date: Mon Jul 10 2017 - 18:17:47 EST


On Monday, July 10, 2017 02:57:48 PM dbasehore . wrote:
> On Mon, Jul 10, 2017 at 6:33 AM, Rafael J. Wysocki <rjw@xxxxxxxxxxxxx> wrote:
> > On Friday, July 07, 2017 05:03:03 PM Derek Basehore wrote:
> >> This adds validation of S0ix entry and enables it on Skylake. Using
> >> the new tick_set_freeze_event function, we program the CPU to wake up
> >> X seconds after entering freeze. After X seconds, it will wake the CPU
> >> to check the S0ix residency counters and make sure we entered the
> >> lowest power state for suspend-to-idle.
> >>
> >> It exits freeze and reports an error to userspace when the SoC does
> >> not enter S0ix on suspend-to-idle.
> >>
> >
> > Honestly, I'm totally unsure about this ATM, as it seems to assume that it
> > doesn't make senes to stay suspended if SLP_S0 residency is not there, but
> > that totally may not be the case.
> >
> > First of all, there are systems in which SLP_S0 is related to about 10%-20% of
> > the total power draw reduction whereas the remaining 80%-90% comes from PC10
> > alone. So, if you can get to PC10, being unable to get SLP_S0 residency on top
> > of that may not be a big deal after all. Of course, you may argue that 10%-20%
> > of battery life while suspended is "a lot", but that really depends on the
> > possible alternatives.
> >
>
> We'd have to track actual PC10 residency instead of checking if it's
> the requested state since the SoC can enter a higher power package
> cstate even if PC10 is requested.

That's correct, but it should be sufficient to check the PC10 residency
(there's some code to do that in turbostat, for example).

> I think this can be done by reading
> an msr register, though. Is there an example of how PC10 can be
> entered without SLP_S0 getting asserted by the way?

Yes, there is.

PC10 is a power state of the north complex and it can be entered regardless
of the SLP_S0 status which is related to the south complex.

> Also, this feature is disabled by default, so it doesn't prevent these
> use cases.
>
> > Second, as far as the alternatives go, it may not be rosy, because there are
> > systems that don't support S3 (or any other ACPI sleep states at all for that
> > matter) and suspend-to-idle is the only suspend mechanism available there.
> > On those systems it still may make sense to use it even though it may not
> > reduce the power draw that much. And from some experiments, suspend-to-idle
> > still extends battery life by 100% over runtime idle even if the system is not
> > able to get to PC10 most of the time.
>
> This is off by default.

Fair enough, but even so it may not be very useful in general as is.

> >
> > While I understand the use case, I don't think it is a binary "yes"-"no" thing
> > and the focus on just SLP_S0 may be misguided.
>
> Do you have a preference such as being able to set the level that you
> want to validate to? For instance, there could be an option to check
> that SLP_So is asserted, but there could also be an option to check
> for PC9 or PC10 residency. For instance, there could be a module
> parameters for setting the validated state:
>
> available_suspend_to_idle_states:
> "none pc6 pc9 pc10 slp_s0"
>
> max_suspend_to_idle_state:
> "none"
>
> Where the default validated state is none, but it can be set to any of
> the states in available_suspend_to_idle_states

In the suspend-to-idle path the driver will always request the deepest state
available (C10 for Skylake) and I would validate the associated package state
by default plus optionally SLP_S0.

> >
> >> One example of a bug that can prevent a Skylake CPU from entering S0ix
> >> (suspend-to-idle) is a leaked reference count to one of the i915 power
> >
> > Suspend-to-idle is not S0ix. Suspend-to-idle is the state the kernel puts the
> > system into after echoing "freeze" to /sys/power/state and S0ix is a platform
> > power state that may or may not be entered as a result of that.
> >
> >> wells. The CPU will not be able to enter Package C10 and will
> >> therefore use about 4x as much power for the entire system. The issue
> >> is not specific to the i915 power wells though.
> >
> > Well, fair enough, but what if the SoC can enter PC10, but without SLP_S0
> > residency on top of it?
> >
> >> Signed-off-by: Derek Basehore <dbasehore@xxxxxxxxxxxx>
> >> ---
> >> drivers/idle/intel_idle.c | 142 +++++++++++++++++++++++++++++++++++++++++++---
> >> 1 file changed, 134 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
> >> index ebed3f804291..d38621da6e54 100644
> >> --- a/drivers/idle/intel_idle.c
> >> +++ b/drivers/idle/intel_idle.c
> >> @@ -61,10 +61,12 @@
> >> #include <linux/notifier.h>
> >> #include <linux/cpu.h>
> >> #include <linux/moduleparam.h>
> >> +#include <linux/suspend.h>
> >> #include <asm/cpu_device_id.h>
> >> #include <asm/intel-family.h>
> >> #include <asm/mwait.h>
> >> #include <asm/msr.h>
> >> +#include <asm/pmc_core.h>
> >>
> >> #define INTEL_IDLE_VERSION "0.4.1"
> >>
> >> @@ -93,12 +95,29 @@ struct idle_cpu {
> >> bool disable_promotion_to_c1e;
> >> };
> >>
> >> +/*
> >> + * The limit for the exponential backoff for the freeze duration. At this point,
> >> + * power impact is is far from measurable. It's about 3uW based on scaling from
> >> + * waking up 10 times a second.
> >> + */
> >> +#define MAX_SLP_S0_SECONDS 1000
> >> +#define SLP_S0_EXP_BASE 10
> >> +
> >> +static bool slp_s0_check;
> >> +static unsigned int slp_s0_seconds;
> >> +
> >> +static DEFINE_SPINLOCK(slp_s0_check_lock);
> >> +static unsigned int slp_s0_num_cpus;
> >> +static bool slp_s0_check_inprogress;
> >> +
> >> static const struct idle_cpu *icpu;
> >> static struct cpuidle_device __percpu *intel_idle_cpuidle_devices;
> >> static int intel_idle(struct cpuidle_device *dev,
> >> struct cpuidle_driver *drv, int index);
> >> static int intel_idle_freeze(struct cpuidle_device *dev,
> >> struct cpuidle_driver *drv, int index);
> >> +static int intel_idle_freeze_and_check(struct cpuidle_device *dev,
> >> + struct cpuidle_driver *drv, int index);
> >> static struct cpuidle_state *cpuidle_state_table;
> >>
> >> /*
> >> @@ -597,7 +616,7 @@ static struct cpuidle_state skl_cstates[] = {
> >> .exit_latency = 2,
> >> .target_residency = 2,
> >> .enter = &intel_idle,
> >> - .enter_freeze = intel_idle_freeze, },
> >> + .enter_freeze = intel_idle_freeze_and_check, },
> >
> > Why do you do this for anything lower than C6?
>
> In that case, it's probably best the fail in these cases if the check
> is enabled. The CPU can't enter a lower cstate than the hinted one,
> correct?

Yes, it can, but this is based on the hint and not on the entered state. :-)

There is some gray area related to what if the user disabled the deepest state
via sysfs, but other than that the check only needs to be made in the deepest
state's callback (because that's what will be requested in the suspend-to-idle
path).

Thanks,
Rafael