Re: Re: [PATCH V2] workqueue: watchdog: update wq_watchdog_touched for unbound lockup checking

From: Petr Mladek
Date: Tue Mar 23 2021 - 11:07:14 EST


On Tue 2021-03-23 20:37:35, 王擎 wrote:
>
> >On Fri 2021-03-19 16:00:36, Wang Qing wrote:
> >> When touch_softlockup_watchdog() is called, only wq_watchdog_touched_cpu
> >> updated, while the unbound worker_pool running on its core uses
> >> wq_watchdog_touched to determine whether locked up. This may be mischecked.
> >
> >By other words, unbound workqueues are not aware of the more common
> >touch_softlockup_watchdog() because it updates only
> >wq_watchdog_touched_cpu for the affected CPU. As a result,
> >the workqueue watchdog might report lockup in unbound workqueue
> >even though it is blocked by a known slow code.
>
> Yes, this is the problem I'm talking about.

I thought more about it. This patch prevents a false positive.
Could it bring an opposite problem and hide real problems?

I mean that an unbound workqueue might get blocked on CPU A
because of a real softlockup. But we might not notice it because
CPU B is touched. Well, there are other ways how to detect
this situation, e.g. the softlockup watchdog.


> >> My suggestion is to update both when touch_softlockup_watchdog() is called,
> >> use wq_watchdog_touched_cpu to check bound, and use wq_watchdog_touched
> >> to check unbound worker_pool.
> >>
> >> Signed-off-by: Wang Qing <wangqing@xxxxxxxx>
> >> ---
> >> kernel/watchdog.c | 5 +++--
> >> kernel/workqueue.c | 17 ++++++-----------
> >> 2 files changed, 9 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> >> index 7110906..107bc38
> >> --- a/kernel/watchdog.c
> >> +++ b/kernel/watchdog.c
> >> @@ -278,9 +278,10 @@ void touch_all_softlockup_watchdogs(void)
> >> * update as well, the only side effect might be a cycle delay for
> >> * the softlockup check.
> >> */
> >> - for_each_cpu(cpu, &watchdog_allowed_mask)
> >> + for_each_cpu(cpu, &watchdog_allowed_mask) {
> >> per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET;
> >> - wq_watchdog_touch(-1);
> >> + wq_watchdog_touch(cpu);
> >
> >Note that wq_watchdog_touch(cpu) newly always updates
> >wq_watchdog_touched. This cycle will set the same jiffies
> >value cpu-times to the same variable.
> >
> Although there is a bit of redundancy here, but the most concise way of
> implementation, and it is certain that it will not affect performance.
>
> >> + }
> >> }
> >>
> >> void touch_softlockup_watchdog_sync(void)
> >> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> >> index 0d150da..be08295
> >> --- a/kernel/workqueue.c
> >> +++ b/kernel/workqueue.c
> >> @@ -5826,8 +5821,8 @@ notrace void wq_watchdog_touch(int cpu)
> >> {
> >> if (cpu >= 0)
> >> per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies;
> >> - else
> >> - wq_watchdog_touched = jiffies;
> >> +
> >> + wq_watchdog_touched = jiffies;
> >> }
> >>
> >> static void wq_watchdog_set_thresh(unsigned long thresh)
> >
> >This last hunk is enough to fix the problem. wq_watchdog_touched will
> >get updated also from cpu-specific touch_softlockup_watchdog().
>
> Not enough in fact, because wq_watchdog_touched was updated in wq_watchdog_touch(),
> so wq_watchdog_touched can no longer be used to judge the bound pool, we must update
> every wq_watchdog_touched_cpu in touch_all_softlockup_watchdogs() for bound judgment.

I see. Your patch makes sense.

I would just improve the commit message. It should better explain
why all the twists are needed. I would write something like:

<proposal>
Subject: workqueue/watchdog: Make unbound workqueues aware of
touch_softlockup_watchdog()

There are two workqueue-specific watchdog timestamps:

+ @wq_watchdog_touched_cpu (per-CPU) updated by
touch_softlockup_watchdog()

+ @wq_watchdog_touched (global) updated by
touch_all_softlockup_watchdogs()

watchdog_timer_fn() checks only the global @wq_watchdog_touched for
unbound workqueues. As a result, unbound workqueues are not aware
of touch_softlockup_watchdog(). The watchdog might report a stall
even when the unbound workqueues are blocked by a known slow code.

Solution:

touch_softlockup_watchdog() must touch also the global
@wq_watchdog_touched timestamp.

The global timestamp can not longer be used for bound workqueues
because it is updated on all CPUs. Instead, bound workqueues
have to check only @wq_watchdog_touched_cpu and these timestamp
has to be updated for all CPUs in touch_all_softlockup_watchdogs().

Beware:

The change might cause the opposite problem. An unbound workqueue
might get blocked on CPU A because of a real softlockup. The workqueue
watchdog would miss it when the timestamp got touched on CPU B.

It is acceptable because softlockups are detected by softlockup
watchdog. The workqueue watchdog is there to detect stalls where
a work never finishes, for example, because of dependencies of works
queued into the same workqueue.
</proposal>

How does that sound?

Best Regards,
Petr