Re: [PATCH 1/2] watchdog: iTCO_wdt: optionally leave watchdog enabledduring restart

From: Bjorn Helgaas
Date: Tue Jul 12 2011 - 13:23:00 EST


On Thu, Jul 7, 2011 at 9:53 AM, Pádraig Brady <P@xxxxxxxxxxxxxx> wrote:
> On 07/07/11 16:39, Valdis.Kletnieks@xxxxxx wrote:
>> On Wed, 06 Jul 2011 10:09:36 MDT, Bjorn Helgaas said:
>>
>>> If we reboot via BIOS, BIOS should disable the watchdog itself, so this
>>> shouldn't cause unintended resets, even if the user interrupts the boot.
>>
>> Yes, but didn't Linus say something about BIOS code authors being
>> crack-addicted monkeys? :)

I shouldn't have written anything about what BIOS "should" do. That's
not very useful because, as you suggest, there is room for variation
there.

The risk I was alluding to was this:
- User boots with "reboot_timeout=X"
- User reboots normally (non-kexec)
- BIOS does some reinitialization
- Machine doesn't autoboot, e.g., because user interrupted boot
- Watchdog resets machine -- this may be unexpected by the user

On the machines I tested, the unexpected reset doesn't happen because
the BIOS reinit includes disabling the watchdog. But obviously, that
depends on BIOS details, so there's no guarantee.

I should have just written something along the lines of:

The reboot_timeout option is intended for kexec reboots, which do
not involve BIOS.
In this case, the reboot_timeout covers the interval between shutdown of the
watchdog driver in the old kernel and startup of the driver in the new kernel.

For normal reboots (via the BIOS), the behavior depends on the BIOS
implementation.
Some BIOSes disable the watchdog timer, so the reboot_timeout only covers the
interval until the BIOS disable. Others leave the timer running, so
the reboot_timeout
may cause a reset if the machine doesn't autoboot, e.g., if the user
interrupts the boot.

I think the *option* of using a reboot_timeout is still useful,
especially in clusters of unattended machines where it's expensive to
deal with boot failures.

> Yes as I said in a round about way in another mail,
> one can't depend on that at all.
> Some reset, some don't, some behave weirdly,
> iTCO is unusual as kernel resets early at boot, ...

You mentioned an unexplained iTCO reset in your other mail. That
sounds like a kernel or iTCO_wdt bug, but I think it's unrelated to
this patch.

> If using this, one would have to set the timeout large enough,
> to encompass a full reboot

Right. In the case of iTCO, I think the range is up to about 10
minutes, which is enough in my case (things like fsck may take longer,
but that's OK as long as the watchdog driver is built in statically).

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/