Re: [PATCH 2/2] watchdog: Add a mechanism to detect stalls on guest vCPUs

From: Guenter Roeck
Date: Wed Apr 06 2022 - 14:12:56 EST

On 4/6/22 09:31, Sebastian Ene wrote:
On Tue, Apr 05, 2022 at 02:15:51PM -0700, Guenter Roeck wrote:

Hello Guenter,

On Tue, Apr 05, 2022 at 02:19:55PM +0000, Sebastian Ene wrote:
This patch adds support for a virtual watchdog which relies on the
per-cpu hrtimers to pet at regular intervals.

The watchdog subsystem is not intended to detect soft and hard lockups.
It is intended to detect userspace issues. A watchdog driver requires
a userspace compinent which needs to ping the watchdog on a regular basis
to prevent timeouts (and watchdog drivers are supposed to use the
watchdog kernel API).

Thanks for getting back ! I wanted to create a mechanism to detect
stalls on vCPUs and I am not sure if the current watchdog subsystem has a way
to create per-CPU binded watchdogs (in the same way as Power PC has
The per-CPU watchdog is needed to account for time that the guest is not
running(either scheduled out or waiting for an event) to prevent spurious
reset events caused by the watchdog.

What you have here is a CPU stall detection mechanism, similar to the
existing soft/hard lockup detection mechanism. This code does not
belong into the watchdog subsystem; it is similar to the existing
hard/softlockup detection code (kernel/watchdog.c) and should reside
at the same location.

I agree that this doesn't belong to the watchdog subsytem but the current
stall detection mechanism calls through MMIO into a virtual device
'qemu,virt-watchdog'. Calling a device from (kernel/watchdog.c) isn't
something that we should avoid ?

You are introducing qemu,virt-watchdog, so it seems to me that any argument
along that line doesn't really apply.

I think it is more a matter for core kernel developers to discuss and
decide how this functionality is best instantiated. It doesn't _have_
to be a device, after all, just like the current lockup detection
code is not a device. Either case, I am not really the right person
to discuss this since it is a matter of core kernel code which I am
not sufficiently familiar with. All I can say is that watchdog drivers
in the watchdog subsystem have a different scope.


Having said that, I could imagine a watchdog driver to be used in VMs,
but that would be similar to existing watchdog drivers. The easiest way
to get there would probably be to just instantiate one of the watchdog
devices already supported by qemu.

I am looking forward for your response,