RE: [PATCH] hpwdt: Fix kdump issue in hpwdt

From: Toshi Kani
Date: Mon Aug 27 2012 - 16:50:25 EST


On Mon, 2012-08-27 at 19:57 +0000, Mingarelli, Thomas wrote:
> The main issue here is when an NMI comes in (which is hpwdt's main
> focus...to source NMIs and then panic the box) and the system is
> configured for kdump. We want the kdump to succeed and if the iLO
> watchdog timer is left alone to keep running, the kdump will not
> succeed. It will be interrupted by an ASR. This change ensures that
> the iLO Watchdog timer is always stopped in the booting case (of any
> kernel) or when an NMI arrives and we are in the process of taking a
> kdump.

And this change does not prevent running the watchdog daemon on the
crash kernel, if we want to detect a hang condition on the crash kernel.
The timer is re-enabled when /dev/watchdog is opened. The change only
assures the timer is enabled when the daemon starts up. The timer
running on the crash kernel without starting the daemon is a problem as
it leads kdump to be interrupted.

Thanks,
-Toshi


>
> Tom
>
> -----Original Message-----
> From: Lars Marowsky-Bree [mailto:lmb@xxxxxxxx]
> Sent: Monday, August 27, 2012 2:22 PM
> To: Kani, Toshimitsu; wim@xxxxxxxxx; linux-watchdog@xxxxxxxxxxxxxxx
> Cc: linux-kernel@xxxxxxxxxxxxxxx; Mingarelli, Thomas; stable@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH] hpwdt: Fix kdump issue in hpwdt
>
> On 2012-08-27T12:52:24, Toshi Kani <toshi.kani@xxxxxx> wrote:
>
> > kdump can be interrupted by watchdog timer when the timer is left
> > activated on the crash kernel. Changed the hpwdt driver to disable
> > watchdog timer at boot-time. This assures that watchdog timer is
> > disabled until /dev/watchdog is opened, and prevents watchdog timer
> > to be left running on the crash kernel.
>
> How does this protect against the system hanging again in the crash
> kernel, or possibly hardware caches to flush more data to shared
> storage?
>
> (I'm asking from the perspective of the hpwdt being used as a fencing
> mechanism in a cluster setting.)
>
> Or is the argument that it's "very unlikely" that a system in such a
> state would not make it far enough into the crash kernel?
>
>
> Regards,
> Lars
>


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/