Re: [PATCH 2/8] watchdog: Introduce hardware maximum timeout in watchdog core

From: Guenter Roeck
Date: Wed Aug 05 2015 - 05:14:30 EST


On 08/05/2015 01:22 AM, Uwe Kleine-König wrote:
Hello Guenter,

On Tue, Aug 04, 2015 at 09:03:27AM -0700, Guenter Roeck wrote:
On 08/04/2015 08:52 AM, Uwe Kleine-König wrote:
On Tue, Aug 04, 2015 at 08:31:43AM -0700, Guenter Roeck wrote:
On 08/04/2015 05:18 AM, Uwe Kleine-König wrote:
On Mon, Aug 03, 2015 at 07:13:28PM -0700, Guenter Roeck wrote:
structure. If the configured timeout exceeds half the value of the
maximum hardware timeout, the watchdog core enables a timer function
to assist sending keepalive requests to the watchdog driver.
I don't understand why you want to halve the maximum hw-timeout. If my
watchdog has hw-max-timeout = 5s and userspace sets it to 3s there
should be no need for assistance?! I think the implementation is the
other way round?

It is supposed to reflect the _maximum_ timeout. That is different to
the time between heartbeats, which is supposed to be less; using half
the value of the maximum hardware timeout seemed to be a safe number.
Right, I got that. With hw-max-timeout = 5s the machine resets after 5s
not caring for the device. And so pinging repeatedly after 2.5s is fine.
But if userspace sets a timeout of 3s (probably with the intention to
ping with a frequency of 1/1.5s) there is no need for worker-assistance,
because the pings coming in each 1.5s provided by userspace are good
enough.

Yes, that is how it is supposed to work.
So for the changelog you want:

If the configured timeout exceeds the maximum hardware timeout
the watchdog core enables a timer function ...

right?

Something like that. You are right, the changelog needs an update.

+static inline bool watchdog_need_worker(struct watchdog_device *wdd)
+{
+ unsigned int hm = wdd->max_hw_timeout_ms;
+ unsigned int m = wdd->max_timeout * 1000;
+
+ return watchdog_active(wdd) && hm && hm != m &&
+ wdd->timeout * 500 > hm;
One problem with the worker I see is that the reset will probably be
delayed with your worker. Consider userspace sets timeout = 10 s because
if the main application doesn't work for 12 s something dangerous can
happen. (Consider a guillotine where the blade can only be hold up for
12 s when not locked. :-) Now if the hw-max-timeout is 9s you setup a
timer to ping at $last_keepalive + 4.5 s and $last_keepalive + 9 s (not
taking timer and system latency into account). That means the system
only resets 18 s after the last userspace ping. Oops.

So ideally you send the last auto-ping at $last_keepalive +
$configured_timeout - $hw-max-timeout (assuming the hardware is
configured for $hw-max-timeout).


Yes, you are right, and makes sense. In practice that means I would
schedule an auto-ping 1s after the most recent user ping in your
example above, not at fixed intervals of 4.5s. I'll look into it.
Thinking about it, it is better anyway to reschedule the auto-ping
after a user ping, to avoid unnecessary pings.

I don't understand what max_timeout is now that there is max_hw_timeout.
So I don't understand why you need hm != m either.


Backward compatibility. A driver which does not set max_hw_timeout_ms,
or sets both to the same value, by definition expects to handle everything
internally, and thus no worker is configured.
And a driver that does

max_timeout = 5
max_hw_timeout = 5125

falls through the cracks.

Hmm - not that this configuration makes any sense, but you are right.
I'll make it "hm < m".
It does not? What do you expect max_timeout to be set to if the maximal
hw-timeout is 5125 ms? 0 would work, but IMHO you need some more
documentation then.


0 would work, or anything larger than 5.125s.

If you want to set max_timeout to 0, there is no need to set max_hw_timeout.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/