Re: [RESEND][PATCH v3] watchdog: Fix disable/enable regression

From: BjÃrn Mork
Date: Wed Dec 19 2012 - 16:17:52 EST


Don Zickus <dzickus@xxxxxxxxxx> writes:
> On Wed, Dec 19, 2012 at 08:51:31PM +0100, BjÃrn Mork wrote:
>> commit 8d451690 ("watchdog: Fix CPU hotplug regression") cause
>> an oops or hard lockup when doing
>>
>> echo 0 > /proc/sys/kernel/nmi_watchdog
>> echo 1 > /proc/sys/kernel/nmi_watchdog
>>
>> and the kernel is booted with nmi_watchdog=1 (default)
>>
>> Running laptop-mode-tools and disconnecting/connecting AC power
>> will cause this to trigger, making it a common failure scenario
>> on laptops.
>>
>> Instead of bailing out of watchdog_disable() when !watchdog_enabled
>> we can initialize the hrtimer regardless of watchdog_enabled status.
>> This makes it safe to call watchdog_disable() in the nmi_watchdog=0
>> case, without the negative effect on the enabled => disabled =>
>> enabled case.
>>
>> All these tests pass with this patch:
>> - nmi_watchdog=1
>> echo 0 > /proc/sys/kernel/nmi_watchdog
>> echo 1 > /proc/sys/kernel/nmi_watchdog
>>
>> - nmi_watchdog=0
>> echo 0 > /sys/devices/system/cpu/cpu1/online
>>
>> - nmi_watchdog=0
>> echo mem > /sys/power/state
>
> What about the opposite cases?
> nmi_watchdog=1
> echo 1 > /sys/devices/system/cpu/cpu1/online

I don't see why not. But verifying it would be nice. I thought that it
would be a simple thing to test using qemu-kvm, but it seems that the
CPU hotplugging support there isn't quite ready. The guest just dies
with "Assertion `bus->allow_hotplug' failed."

I'll go digging for alternatives, but if anyone else could verify this
then I'd appreciate it.

> Are we going to leak memory by re-initing hrtimer? Or is that caught
> somewhere?

There are no allocations in hrtimer_init() AFAICS? It just initializes
the preallocated hrtimer struct.

> Other than that, the patch seems reasonable to me. Basically just forcing
> the init of hrtimer so there is something to cancel on the disable side.
> Then again, I don't fully understand the original problem.

I believe the original problem was calling hrtimer_cancel on an
uninitialized hrtimer. That is likely to blow up here:

static inline
void unlock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
{
raw_spin_unlock_irqrestore(&timer->base->cpu_base->lock, *flags);
}


Note that the lock_hrtimer_base() function protects access to
timer->base with a NULL test. That should probably be done in unlock as
well, for symmetry? But this is another issue and not at all urgent.




BjÃrn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/