Re: [PATCH v9 1/3] watchdog: introduce watchdog.open_timeout commandline parameter

From: Rasmus Villemoes
Date: Tue Jan 29 2019 - 15:37:00 EST


On 22/01/2019 18.29, Guenter Roeck wrote:
> On Mon, Jan 21, 2019 at 08:45:39PM +0000, Rasmus Villemoes wrote:
>> The watchdog framework takes care of feeding a hardware watchdog until
>> userspace opens /dev/watchdogN. If that never happens for some reason
>> (buggy init script, corrupt root filesystem or whatnot) but the kernel
>> itself is fine, the machine stays up indefinitely. This patch allows
>> setting an upper limit for how long the kernel will take care of the
>> watchdog, thus ensuring that the watchdog will eventually reset the
>> machine.
>>
>> A value of 0 (the default) means infinite timeout, preserving the
>> current behaviour.
>>
>> This is particularly useful for embedded devices where some fallback
>> logic is implemented in the bootloader (e.g., use a different root
>> partition, boot from network, ...).
>>
>> There is already handle_boot_enabled serving a similar purpose. However,
>> such a binary choice is unsuitable if the hardware watchdog cannot be
>> programmed by the bootloader to provide a timeout long enough for
>> userspace to get up and running. Many of the embedded devices we see use
>> external (gpio-triggered) watchdogs with a fixed timeout of the order of
>> 1-2 seconds.
>>
>> The open timeout is also used as a maximum time for an application to
>> re-open /dev/watchdogN after closing it. Again, while the kernel already
>> has a nowayout mechanism, using that means userspace is at the mercy of
>> whatever timeout the hardware has.
>>
>> Being a module parameter, one can revert to the ordinary behaviour of
>> having the kernel maintain the watchdog indefinitely by simply writing 0
>> to /sys/... after initially opening /dev/watchdog; conversely, one can
>> of course also have the current behaviour of allowing indefinite time
>> until the first open, and then set that module parameter.
>>
>> Signed-off-by: Rasmus Villemoes <rasmus.villemoes@xxxxxxxxx>
>> ---
>> .../watchdog/watchdog-parameters.txt | 8 +++++
>> drivers/watchdog/watchdog_dev.c | 30 +++++++++++++++++--
>> 2 files changed, 36 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/watchdog/watchdog-parameters.txt b/Documentation/watchdog/watchdog-parameters.txt
>> index 0b88e333f9e1..907c4bb13810 100644
>> --- a/Documentation/watchdog/watchdog-parameters.txt
>> +++ b/Documentation/watchdog/watchdog-parameters.txt
>> @@ -8,6 +8,14 @@ See Documentation/admin-guide/kernel-parameters.rst for information on
>> providing kernel parameters for builtin drivers versus loadable
>> modules.
>>
>> +The watchdog core parameter watchdog.open_timeout is the maximum time,
>> +in seconds, for which the watchdog framework will take care of pinging
>> +a hardware watchdog until userspace opens the corresponding
>> +/dev/watchdogN device. A value of 0 (the default) means an infinite
>> +timeout. Setting this to a non-zero value can be useful to ensure that
>> +either userspace comes up properly, or the board gets reset and allows
>> +fallback logic in the bootloader to try something else.
>> +
>
> This is misleading. Unless I am missing something, the above only applies
> if the watchdog is already runnning at boot,

well, yes, if it's not running at boot, there's nothing for the kernel
to take care of. I can do s/a hardware watchdog/a running hardware
watchdog/ if that makes it clearer.

and after it has been opened
> and closed once.
>
> FWIW, I find this operation quite confusing. What is the rationale for not
> starting the watchdog at boot time if it is not running and open_timeout
> is set, but then refusing to stop it after it has been started once ?

I guess you have a point that there's a certain asymmetry there. In
practice, the cases where one wants this kind of robustness guarantee do
not rely on a watchdog that can/must be started and stopped by software.

>>
>> static void watchdog_ping_work(struct kthread_work *work)
>> @@ -297,7 +317,7 @@ static int watchdog_stop(struct watchdog_device *wdd)
>> return -EBUSY;
>> }
>>
>> - if (wdd->ops->stop) {
>> + if (wdd->ops->stop && !open_timeout) {
>
> This changes the semantics of WDIOC_SETOPTIONS / WDIOS_DISABLECARD.
> "Turn off the watchdog timer" is well defined and doesn't leave
> the option of setting a timeout on it.

I can drop this hunk, since it's mostly irrelevant to the actual use
cases I have in mind. It makes testing the feature on reference boards a
little more awkward, but I can live with that.

>> +
>> +module_param(open_timeout, uint, 0644);
>> +MODULE_PARM_DESC(open_timeout,
>> + "Maximum time (in seconds, 0 means infinity) for userspace to take over a running watchdog (default=0)");
>
> The description is misleading. After the initial open, a subsequent close no
> longer really stops the watchdog.

If I drop the "&& !open_timeout" above, this would be accurate, no?

Rasmus