Re: [PATCH v3]PM/Sleep: Timer quiesce in freeze state

From: Rafael J. Wysocki
Date: Thu Jan 29 2015 - 16:57:28 EST


On Monday, January 26, 2015 10:40:24 AM Thomas Gleixner wrote:
> On Mon, 26 Jan 2015, Li, Aubrey wrote:
> > On 2015/1/22 18:15, Thomas Gleixner wrote:
> > > Can we please stop adding more crap to that notifier thing? I rather
> > > see that go away than being expanded.
> >
> > Are you referring to FREEZE_PREPARE or remove all of FREEZE staff at all?
> >
> > What's the disadvantage of adding more notifier?
>
> clockevents_notify() is not a notifier. Its a multiplex call and I
> want to get rid of it and replace it with explicit functions.
>
> > >> + /*
> > >> + * cpuidle_enter will return with interrupt enabled
> > >> + */
> > >> + cpuidle_enter(drv, dev, next_state);
> > >
> > > How is that supposed to work?
> > >
> > > If timekeeping is not yet unfrozen, then any interrupt handling code
> > > which calls anything time related is going to hit lala land.
> > >
> > > You must guarantee that timekeeping is unfrozen before any interrupt
> > > is handled. If you cannot guarantee that, you cannot freeze
> > > timekeeping ever.
> > >
> > > The cpu local tick device is less critical, but it happens to work by
> > > chance, not by design.
> >
> > There are two way to guarantee this: the first way is, disable interrupt
> > before timekeeping frozen and enable interrupt after timekeeping is
> > unfrozen. However, we need to handle wakeup handler before unfreeze
> > timekeeping to wake freeze task up from wait queue.
> >
> > So we have to go the other way, the other way is, we ignore time related
> > calls during freeze, like what I added in irq_enter below.
>
> Groan. You just do not call in irq_enter/exit(), but what prevents any
> interrupt handler or whatever to call into the time/timer code after
> interrupts got reenabled?
>
> Nothing.
>
> > Or, we need to re-implement freeze wait and wake up mechanism?
>
> You need to make sure in the low level idle implementation that this
> cannot happen.
>
> tick_freeze()
> {
> raw_spin_lock(&tick_freeze_lock);
> tick_frozen++;
> if (tick_frozen == num_online_cpus())
> timekeeping_suspend();
> else
> tick_suspend_local();
> raw_spin_unlock(&tick_freeze_lock);
> }
>
> tick_unfreeze()
> {
> raw_spin_lock(&tick_freeze_lock);
> if (tick_frozen == num_online_cpus())
> timekeeping_resume();
> else
> tick_resume_local();
> tick_frozen--;
> raw_spin_unlock(&tick_freeze_lock);
> }
>
> idle_freeze()
> {
> local_irq_disable();
>
> tick_freeze();
>
> /* Must keep interrupts disabled! */
> go_deep_idle()
>
> tick_unfreeze();
>
> local_irq_enable();
> }
>
> That's the only way you can do it proper, everything else will just be
> a horrible mess of bandaids and duct tape.
>
> So that does not need any of the irq_enter/exit conditionals, it does
> not need the real_handler hack. It just works.
>
> The only remaining issue might be a NMI calling into
> ktime_get_mono_fast_ns() before timekeeping is resumed. Its probably a
> non issue on x86/tsc, but it might be a problem on other platforms
> which turn off devices, clocks, It's not rocket science to prevent
> that.

So the patch below is my version of the $subject one trying to take the
above suggestion into account. It doesn not address the ktime_get_mono_fast_ns()
in NMI case at this stage, because quite frankly I'd prefer to get the core
changes right in the first place.

The new ->enter_freeze callback is there in struct cpuidle_state, because we
generally cannot trust the existing ->enter callbacks to keep interrupts
disabled all the time. Some of them enable interrupts temporarily and
some idle entry methods enable interrupts automatically on exit. So
->enter_freeze is required to keep interrupts disabled and it is safe to
suspend/resume timekeeping around it.

The rest is reasonably straightforward. If suspend-to-idle is requested,
couidle_idle_call() calls cpuidle_enter_freeze() that bypasses the governor
and selects the deepest suitable state. If ->enter_freeze is available,
it will be used along with tick_freeze()/tick_unfreeze() and the normal
idle entry code path will be used otherwise.

I'm not quite sure if rcu_idle_enter/exit() needs to be done around that.
The reason why it isn't in the current patch is because timekeeping_resume()
was very unhappy about being part of the extended grace period and complained
about that very loudly.

Which of course means that the patch has been tested (lightly, with a hacked
ACPI cpuidle driver) and that's why it still uses tick_suspend_broadcast()
in tick_freeze().

Please let me know what you think.

Rafael


---
drivers/cpuidle/cpuidle.c | 83 +++++++++++++++++++++++++++++++---------------
include/linux/cpuidle.h | 8 +++-
include/linux/suspend.h | 2 +
include/linux/tick.h | 2 +
kernel/power/suspend.c | 37 ++++++++++++++++----
kernel/sched/idle.c | 9 ++++
kernel/time/tick-common.c | 50 +++++++++++++++++++++++++++
kernel/time/timekeeping.c | 4 +-
kernel/time/timekeeping.h | 2 +
9 files changed, 160 insertions(+), 37 deletions(-)

Index: linux-pm/kernel/power/suspend.c
===================================================================
--- linux-pm.orig/kernel/power/suspend.c
+++ linux-pm/kernel/power/suspend.c
@@ -37,7 +37,20 @@ const char *pm_states[PM_SUSPEND_MAX];
static const struct platform_suspend_ops *suspend_ops;
static const struct platform_freeze_ops *freeze_ops;
static DECLARE_WAIT_QUEUE_HEAD(suspend_freeze_wait_head);
-static bool suspend_freeze_wake;
+
+/* Suspend-to-idle state machnine. */
+enum freeze_state {
+ FREEZE_STATE_NONE, /* Not suspended/suspending. */
+ FREEZE_STATE_ENTER, /* Enter suspend-to-idle. */
+ FREEZE_STATE_WAKE, /* Wake up from suspend-to-idle. */
+};
+
+static enum freeze_state suspend_freeze_state;
+
+bool idle_should_freeze(void)
+{
+ return suspend_freeze_state == FREEZE_STATE_ENTER;
+}

void freeze_set_ops(const struct platform_freeze_ops *ops)
{
@@ -48,22 +61,32 @@ void freeze_set_ops(const struct platfor

static void freeze_begin(void)
{
- suspend_freeze_wake = false;
+ suspend_freeze_state = FREEZE_STATE_NONE;
}

static void freeze_enter(void)
{
- cpuidle_use_deepest_state(true);
+ suspend_freeze_state = FREEZE_STATE_ENTER;
+ get_online_cpus();
cpuidle_resume();
- wait_event(suspend_freeze_wait_head, suspend_freeze_wake);
+ /* Push all the CPUs into the idle loop. */
+ wake_up_all_idle_cpus();
+ pr_debug("PM: suspend-to-idle\n");
+ /* Make the current CPU wait so it can enter the idle loop too. */
+ wait_event(suspend_freeze_wait_head,
+ suspend_freeze_state == FREEZE_STATE_WAKE);
+ pr_debug("PM: resume from suspend-to-idle\n");
cpuidle_pause();
- cpuidle_use_deepest_state(false);
+ put_online_cpus();
+ suspend_freeze_state = FREEZE_STATE_NONE;
}

void freeze_wake(void)
{
- suspend_freeze_wake = true;
- wake_up(&suspend_freeze_wait_head);
+ if (suspend_freeze_state > FREEZE_STATE_NONE) {
+ suspend_freeze_state = FREEZE_STATE_WAKE;
+ wake_up(&suspend_freeze_wait_head);
+ }
}
EXPORT_SYMBOL_GPL(freeze_wake);

Index: linux-pm/include/linux/suspend.h
===================================================================
--- linux-pm.orig/include/linux/suspend.h
+++ linux-pm/include/linux/suspend.h
@@ -203,6 +203,7 @@ extern void suspend_set_ops(const struct
extern int suspend_valid_only_mem(suspend_state_t state);
extern void freeze_set_ops(const struct platform_freeze_ops *ops);
extern void freeze_wake(void);
+extern bool idle_should_freeze(void);

/**
* arch_suspend_disable_irqs - disable IRQs for suspend
@@ -230,6 +231,7 @@ static inline void suspend_set_ops(const
static inline int pm_suspend(suspend_state_t state) { return -ENOSYS; }
static inline void freeze_set_ops(const struct platform_freeze_ops *ops) {}
static inline void freeze_wake(void) {}
+static inline bool idle_should_freeze(void) { return false; }
#endif /* !CONFIG_SUSPEND */

/* struct pbe is used for creating lists of pages that should be restored
Index: linux-pm/kernel/sched/idle.c
===================================================================
--- linux-pm.orig/kernel/sched/idle.c
+++ linux-pm/kernel/sched/idle.c
@@ -7,6 +7,7 @@
#include <linux/tick.h>
#include <linux/mm.h>
#include <linux/stackprotector.h>
+#include <linux/suspend.h>

#include <asm/tlb.h>

@@ -96,6 +97,13 @@ static void cpuidle_idle_call(void)
*/
stop_critical_timings();

+ if (idle_should_freeze()) {
+ cpuidle_enter_freeze();
+ local_irq_enable();
+ start_critical_timings();
+ return;
+ }
+
/*
* Tell the RCU framework we are entering an idle section,
* so no more rcu read side critical sections and one more
@@ -220,6 +228,7 @@ static void cpu_idle_loop(void)
* know that the IPI is going to arrive right
* away
*/
+
if (cpu_idle_force_poll || tick_check_broadcast_expired())
cpu_idle_poll();
else
Index: linux-pm/drivers/cpuidle/cpuidle.c
===================================================================
--- linux-pm.orig/drivers/cpuidle/cpuidle.c
+++ linux-pm/drivers/cpuidle/cpuidle.c
@@ -19,6 +19,8 @@
#include <linux/ktime.h>
#include <linux/hrtimer.h>
#include <linux/module.h>
+#include <linux/suspend.h>
+#include <linux/tick.h>
#include <trace/events/power.h>

#include "cpuidle.h"
@@ -32,7 +34,6 @@ LIST_HEAD(cpuidle_detected_devices);
static int enabled_devices;
static int off __read_mostly;
static int initialized __read_mostly;
-static bool use_deepest_state __read_mostly;

int cpuidle_disabled(void)
{
@@ -66,36 +67,23 @@ int cpuidle_play_dead(void)
}

/**
- * cpuidle_use_deepest_state - Enable/disable the "deepest idle" mode.
- * @enable: Whether enable or disable the feature.
- *
- * If the "deepest idle" mode is enabled, cpuidle will ignore the governor and
- * always use the state with the greatest exit latency (out of the states that
- * are not disabled).
- *
- * This function can only be called after cpuidle_pause() to avoid races.
- */
-void cpuidle_use_deepest_state(bool enable)
-{
- use_deepest_state = enable;
-}
-
-/**
- * cpuidle_find_deepest_state - Find the state of the greatest exit latency.
- * @drv: cpuidle driver for a given CPU.
- * @dev: cpuidle device for a given CPU.
+ * cpuidle_find_deepest_state - Find deepest state meeting specific conditions.
+ * @drv: cpuidle driver for the given CPU.
+ * @dev: cpuidle device for the given CPU.
+ * @freeze: Whether or not the state should be suitable for suspend-to-idle.
*/
static int cpuidle_find_deepest_state(struct cpuidle_driver *drv,
- struct cpuidle_device *dev)
+ struct cpuidle_device *dev, bool freeze)
{
unsigned int latency_req = 0;
- int i, ret = CPUIDLE_DRIVER_STATE_START - 1;
+ int i, ret = freeze ? -1 : CPUIDLE_DRIVER_STATE_START - 1;

for (i = CPUIDLE_DRIVER_STATE_START; i < drv->state_count; i++) {
struct cpuidle_state *s = &drv->states[i];
struct cpuidle_state_usage *su = &dev->states_usage[i];

- if (s->disabled || su->disable || s->exit_latency <= latency_req)
+ if (s->disabled || su->disable || s->exit_latency <= latency_req
+ || (freeze && !s->enter_freeze))
continue;

latency_req = s->exit_latency;
@@ -104,6 +92,52 @@ static int cpuidle_find_deepest_state(st
return ret;
}

+static void enter_freeze_proper(struct cpuidle_driver *drv,
+ struct cpuidle_device *dev, int index)
+{
+ tick_freeze();
+ drv->states[index].enter_freeze(dev, drv, index);
+ tick_unfreeze();
+}
+
+/**
+ * cpuidle_enter_freeze - Enter an idle state suitable for suspend-to-idle.
+ *
+ * If there are states with the ->enter_freeze callback, find the deepest of
+ * them and enter it with frozen tick. Otherwise, find the deepest state
+ * available and enter it normally.
+ */
+void cpuidle_enter_freeze(void)
+{
+ struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
+ struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
+ int index;
+
+ /*
+ * Find the deepest state with ->enter_freeze present, which guarantees
+ * that interrupts won't be enabled when it exits and allows the tick to
+ * be frozen safely.
+ */
+ index = cpuidle_find_deepest_state(drv, dev, true);
+ if (index >= 0) {
+ enter_freeze_proper(drv, dev, index);
+ return;
+ }
+
+ /*
+ * It is not safe to freeze the tick, find the deepest state available
+ * at all and try to enter it normally.
+ */
+ index = cpuidle_find_deepest_state(drv, dev, false);
+ if (index >= 0)
+ cpuidle_enter(drv, dev, index);
+ else
+ arch_cpu_idle();
+
+ /* Interrupts are enabled again here. */
+ local_irq_disable();
+}
+
/**
* cpuidle_enter_state - enter the state and update stats
* @dev: cpuidle device for this cpu
@@ -166,9 +200,6 @@ int cpuidle_select(struct cpuidle_driver
if (!drv || !dev || !dev->enabled)
return -EBUSY;

- if (unlikely(use_deepest_state))
- return cpuidle_find_deepest_state(drv, dev);
-
return cpuidle_curr_governor->select(drv, dev);
}

@@ -200,7 +231,7 @@ int cpuidle_enter(struct cpuidle_driver
*/
void cpuidle_reflect(struct cpuidle_device *dev, int index)
{
- if (cpuidle_curr_governor->reflect && !unlikely(use_deepest_state))
+ if (cpuidle_curr_governor->reflect)
cpuidle_curr_governor->reflect(dev, index);
}

Index: linux-pm/include/linux/cpuidle.h
===================================================================
--- linux-pm.orig/include/linux/cpuidle.h
+++ linux-pm/include/linux/cpuidle.h
@@ -50,6 +50,10 @@ struct cpuidle_state {
int index);

int (*enter_dead) (struct cpuidle_device *dev, int index);
+
+ void (*enter_freeze) (struct cpuidle_device *dev,
+ struct cpuidle_driver *drv,
+ int index);
};

/* Idle State Flags */
@@ -141,7 +145,7 @@ extern void cpuidle_resume(void);
extern int cpuidle_enable_device(struct cpuidle_device *dev);
extern void cpuidle_disable_device(struct cpuidle_device *dev);
extern int cpuidle_play_dead(void);
-extern void cpuidle_use_deepest_state(bool enable);
+extern void cpuidle_enter_freeze(void);

extern struct cpuidle_driver *cpuidle_get_cpu_driver(struct cpuidle_device *dev);
#else
@@ -174,7 +178,7 @@ static inline int cpuidle_enable_device(
{return -ENODEV; }
static inline void cpuidle_disable_device(struct cpuidle_device *dev) { }
static inline int cpuidle_play_dead(void) {return -ENODEV; }
-static inline void cpuidle_use_deepest_state(bool enable) {}
+static inline void cpuidle_enter_freeze(void) { }
static inline struct cpuidle_driver *cpuidle_get_cpu_driver(
struct cpuidle_device *dev) {return NULL; }
#endif
Index: linux-pm/kernel/time/tick-common.c
===================================================================
--- linux-pm.orig/kernel/time/tick-common.c
+++ linux-pm/kernel/time/tick-common.c
@@ -394,6 +394,56 @@ void tick_resume(void)
}
}

+static DEFINE_RAW_SPINLOCK(tick_freeze_lock);
+static unsigned int tick_freeze_depth;
+
+/**
+ * tick_freeze - Suspend the local tick and (possibly) timekeeping.
+ *
+ * Check if this is the last online CPU executing the function and if so,
+ * suspend timekeeping. Otherwise suspend the local tick.
+ *
+ * Call with interrupts disabled. Must be balanced with %tick_unfreeze().
+ * Interrupts must not be enabled before the subsequent %tick_unfreeze().
+ */
+void tick_freeze(void)
+{
+ raw_spin_lock(&tick_freeze_lock);
+
+ tick_freeze_depth++;
+ if (tick_freeze_depth == num_online_cpus()) {
+ timekeeping_suspend();
+ } else {
+ tick_suspend();
+ tick_suspend_broadcast();
+ }
+
+ raw_spin_unlock(&tick_freeze_lock);
+}
+
+/**
+ * tick_unfreeze - Resume the local tick and (possibly) timekeeping.
+ *
+ * Check if this is the first CPU executing the function and if so, resume
+ * timekeeping. Otherwise resume the local tick.
+ *
+ * Call with interrupts disabled. Must be balanced with %tick_freeze().
+ * Interrupts must not be enabled after the preceding %tick_freeze().
+ */
+void tick_unfreeze(void)
+{
+ raw_spin_lock(&tick_freeze_lock);
+
+ if (tick_freeze_depth == num_online_cpus())
+ timekeeping_resume();
+ else
+ tick_resume();
+
+ tick_freeze_depth--;
+
+ raw_spin_unlock(&tick_freeze_lock);
+}
+
/**
* tick_init - initialize the tick control
*/
Index: linux-pm/include/linux/tick.h
===================================================================
--- linux-pm.orig/include/linux/tick.h
+++ linux-pm/include/linux/tick.h
@@ -226,5 +226,7 @@ static inline void tick_nohz_task_switch
__tick_nohz_task_switch(tsk);
}

+extern void tick_freeze(void);
+extern void tick_unfreeze(void);

#endif
Index: linux-pm/kernel/time/timekeeping.c
===================================================================
--- linux-pm.orig/kernel/time/timekeeping.c
+++ linux-pm/kernel/time/timekeeping.c
@@ -1170,7 +1170,7 @@ void timekeeping_inject_sleeptime64(stru
* xtime/wall_to_monotonic/jiffies/etc are
* still managed by arch specific suspend/resume code.
*/
-static void timekeeping_resume(void)
+void timekeeping_resume(void)
{
struct timekeeper *tk = &tk_core.timekeeper;
struct clocksource *clock = tk->tkr.clock;
@@ -1251,7 +1251,7 @@ static void timekeeping_resume(void)
hrtimers_resume();
}

-static int timekeeping_suspend(void)
+int timekeeping_suspend(void)
{
struct timekeeper *tk = &tk_core.timekeeper;
unsigned long flags;
Index: linux-pm/kernel/time/timekeeping.h
===================================================================
--- linux-pm.orig/kernel/time/timekeeping.h
+++ linux-pm/kernel/time/timekeeping.h
@@ -16,5 +16,7 @@ extern int timekeeping_inject_offset(str
extern s32 timekeeping_get_tai_offset(void);
extern void timekeeping_set_tai_offset(s32 tai_offset);
extern void timekeeping_clocktai(struct timespec *ts);
+extern int timekeeping_suspend(void);
+extern void timekeeping_resume(void);

#endif

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/