Re: [PATCH V3] workqueue/watchdog: Make unbound workqueues aware of

From: Petr Mladek
Date: Mon Mar 29 2021 - 06:14:52 EST


On Wed 2021-03-24 19:34:02, Wang Qing wrote:
> There are two workqueue-specific watchdog timestamps:
>
> + @wq_watchdog_touched_cpu (per-CPU) updated by
> touch_softlockup_watchdog()
>
> + @wq_watchdog_touched (global) updated by
> touch_all_softlockup_watchdogs()
>
> watchdog_timer_fn() checks only the global @wq_watchdog_touched for
> unbound workqueues. As a result, unbound workqueues are not aware
> of touch_softlockup_watchdog(). The watchdog might report a stall
> even when the unbound workqueues are blocked by a known slow code.
>
> Solution:
> touch_softlockup_watchdog() must touch also the global @wq_watchdog_touched
> timestamp.
>
> The global timestamp can not longer be used for bound workqueues
> because it is updated on all CPUs. Instead, bound workqueues
> have to check only @wq_watchdog_touched_cpu and these timestamp
> has to be updated for all CPUs in touch_all_softlockup_watchdogs().
>
> Beware:
> The change might cause the opposite problem. An unbound workqueue
> might get blocked on CPU A because of a real softlockup. The workqueue
> watchdog would miss it when the timestamp got touched on CPU B.
>
> It is acceptable because softlockups are detected by softlockup
> watchdog. The workqueue watchdog is there to detect stalls where
> a work never finishes, for example, because of dependencies of works
> queued into the same workqueue.
>
> V3:
> - Modify the commit message clearly according to Petr's suggestion.
>
> Signed-off-by: Wang Qing <wangqing@xxxxxxxx>

The patch fixes a real problem:

Reviewed-by: Petr Mladek <pmladek@xxxxxxxx>

Best Regards,
Petr