Re: [PATCH] watchdog: fix for lockup detector breakage on resume

From: Sameer Nanda
Date: Fri Apr 27 2012 - 17:40:21 EST


On Fri, Apr 27, 2012 at 2:12 PM, Andrew Morton
<akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Fri, 27 Apr 2012 11:10:40 -0700
> Sameer Nanda <snanda@xxxxxxxxxxxx> wrote:
>
>> On the suspend/resume path the boot CPU does not go though an
>> offline->online transition. ÂThis breaks the NMI detector
>> post-resume since it depends on PMU state that is lost when
>> the system gets suspended.
>>
>> Fix this by forcing a CPU offline->online transition for the
>> lockup detector on the boot CPU during resume.
>>
>> Signed-off-by: Sameer Nanda <snanda@xxxxxxxxxxxx>
>> ---
>> To provide more context, we enable NMI watchdog on
>> Chrome OS. ÂWe have seen several reports of systems freezing
>> up completely which indicated that the NMI watchdog was not
>> firing for some reason.
>>
>> Debugging further, we found a simple way of repro'ing system
>> freezes -- issuing the command 'tasket 1 sh -c "echo nmilockup > /proc/breakme"'
>> after the system has been suspended/resumed one or more times.
>>
>> With this patch in place, the system freeze result in panics,
>> as expected. ÂThese panics provide a nice stack trace for us
>> to debug the actual issue causing the freeze.
>>
>> ...
>>
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -317,6 +317,7 @@ extern int proc_dowatchdog_thresh(struct ctl_table *table, int write,
>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â size_t *lenp, loff_t *ppos);
>> Âextern unsigned int Âsoftlockup_panic;
>> Âvoid lockup_detector_init(void);
>> +void lockup_detector_bootcpu_resume(void);
>> Â#else
>> Âstatic inline void touch_softlockup_watchdog(void)
>> Â{
>> @@ -330,6 +331,9 @@ static inline void touch_all_softlockup_watchdogs(void)
>> Âstatic inline void lockup_detector_init(void)
>> Â{
>> Â}
>> +static inline void lockup_detector_bootcpu_resume(void)
>> +{
>> +}
>> Â#endif
>>
>> Â#ifdef CONFIG_DETECT_HUNG_TASK
>> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
>> index 396d262..0d262a8 100644
>> --- a/kernel/power/suspend.c
>> +++ b/kernel/power/suspend.c
>> @@ -177,6 +177,9 @@ static int suspend_enter(suspend_state_t state, bool *wakeup)
>> Â Â Â arch_suspend_enable_irqs();
>> Â Â Â BUG_ON(irqs_disabled());
>>
>> + Â Â /* Kick the lockup detector */
>> + Â Â lockup_detector_bootcpu_resume();
>> +
>> Â Enable_cpus:
>> Â Â Â enable_nonboot_cpus();
>>
>> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
>> index df30ee0..dd2ac93 100644
>> --- a/kernel/watchdog.c
>> +++ b/kernel/watchdog.c
>> @@ -585,6 +585,22 @@ static struct notifier_block __cpuinitdata cpu_nfb = {
>> Â Â Â .notifier_call = cpu_callback
>> Â};
>>
>> +void lockup_detector_bootcpu_resume(void)
>> +{
>> + Â Â void *cpu = (void *)(long)smp_processor_id();
>> +
>> + Â Â /*
>> + Â Â Â* On the suspend/resume path the boot CPU does not go though the
>> + Â Â Â* offline->online transition. This breaks the NMI detector post
>> + Â Â Â* resume. Force an offline->online transition for the boot CPU on
>> + Â Â Â* resume.
>> + Â Â Â*/
>> + Â Â cpu_callback(&cpu_nfb, CPU_DEAD, cpu);
>> + Â Â cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
>> +
>> + Â Â return;
>> +}
>
> I have issues with the comment ;) It describes some old bug which isn't
> there any more and which nobody cares about. ÂA better comment would
> simply describe the function in the usual fashion. ÂSomething like
> this:
>
> --- a/kernel/watchdog.c~nmi-watchdog-fix-for-lockup-detector-breakage-on-resume-fix
> +++ a/kernel/watchdog.c
> @@ -597,20 +597,17 @@ static struct notifier_block __cpuinitda
> Â Â Â Â.notifier_call = cpu_callback
> Â};
>
> +/*
> + * On entry to suspend we force an offline->online transition on the boot CPU so
> + * that PMU state is available to that CPU when it comes back online after
> + * resume. ÂThis information is required for restarting the NMI watchdog.

This call actually happens on "exit from suspend" or "entry into
resume" processing so
how about something like:

On exit from suspend we force an offline->online transition on the boot CPU so
that the PMU state that was lost while in suspended state gets set up properly
for the boot CPU. This information is required for restarting the NMI watchdog.

> + */
> Âvoid lockup_detector_bootcpu_resume(void)
> Â{
> Â Â Â Âvoid *cpu = (void *)(long)smp_processor_id();
>
> - Â Â Â /*
> - Â Â Â Â* On the suspend/resume path the boot CPU does not go though the
> - Â Â Â Â* offline->online transition. This breaks the NMI detector post
> - Â Â Â Â* resume. Force an offline->online transition for the boot CPU on
> - Â Â Â Â* resume.
> - Â Â Â Â*/
> Â Â Â Âcpu_callback(&cpu_nfb, CPU_DEAD, cpu);
> Â Â Â Âcpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
> -
> - Â Â Â return;
> Â}
>
> Âvoid __init lockup_detector_init(void)
> _
>
>
> But I'm not sure how accurate it is. ÂIs it true that the PMU data was
> required for starting the NMI hardware?
>
>
> Also, this is all dead code if CONFIG_SUSPEND=n, so how about
>
> --- a/include/linux/sched.h~nmi-watchdog-fix-for-lockup-detector-breakage-on-resume-fix-fix
> +++ a/include/linux/sched.h
> @@ -317,7 +317,6 @@ extern int proc_dowatchdog_thresh(struct
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âsize_t *lenp, loff_t *ppos);
> Âextern unsigned int Âsoftlockup_panic;
> Âvoid lockup_detector_init(void);
> -void lockup_detector_bootcpu_resume(void);
> Â#else
> Âstatic inline void touch_softlockup_watchdog(void)
> Â{
> @@ -331,6 +330,11 @@ static inline void touch_all_softlockup_
> Âstatic inline void lockup_detector_init(void)
> Â{
> Â}
> +#endif
> +
> +#if defined(CONFIG_LOCKUP_DETECTOR) && defined(CONFIG_SUSPEND)
> +void lockup_detector_bootcpu_resume(void);
> +#else
> Âstatic inline void lockup_detector_bootcpu_resume(void)
> Â{
> Â}
> --- a/kernel/watchdog.c~nmi-watchdog-fix-for-lockup-detector-breakage-on-resume-fix-fix
> +++ a/kernel/watchdog.c
> @@ -597,6 +597,7 @@ static struct notifier_block __cpuinitda
> Â Â Â Â.notifier_call = cpu_callback
> Â};
>
> +#ifdef CONFIG_SUSPEND
> Â/*
> Â* On entry to suspend we force an offline->online transition on the boot CPU so
> Â* that PMU state is available to that CPU when it comes back online after
> @@ -609,6 +610,7 @@ void lockup_detector_bootcpu_resume(void
> Â Â Â Âcpu_callback(&cpu_nfb, CPU_DEAD, cpu);
> Â Â Â Âcpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
> Â}
> +#endif
>
> Âvoid __init lockup_detector_init(void)
> Â{
> _
>



--
Sameer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/