Re: [PATCH 1/4] watchdog/hardlockup: Always update saved interrupts during check

From: Doug Anderson

Date: Wed Mar 04 2026 - 19:59:50 EST

Hi,

On Wed, Mar 4, 2026 at 6:44 AM Petr Mladek <pmladek@xxxxxxxx> wrote:
>
> On Thu 2026-02-12 14:12:10, Mayank Rungta via B4 Relay wrote:
> > From: Mayank Rungta <mrungta@xxxxxxxxxx>
> >
> > Currently, arch_touch_nmi_watchdog() causes an early return that
> > skips updating hrtimer_interrupts_saved. This leads to stale
> > comparisons and delayed lockup detection.
> >
> > Update the saved interrupt count before checking the touched flag
> > to ensure detection timeliness.
>
> IMHO, it is not that easy, see below.
>
> So I am curious. Have you found this when debugging a particular
> problem or just by reading the code, please?

As I understand it, Mayank found this because the watchdog was
reacting significantly more slowly than he expected. In his caes, he
tracked it down to the fact that 8250 console driver has several calls
to touch_nmi_watchdog(), including on every call to console_write().
This caused the watchdog to take _much_ longer to fire.

On devices that fairly chatty w/ their output to the serial console,
the console_write() is called almost constantly. That means that the
watchdog is being touched constantly. If I remember Mayank tracked it
down as:

* watchdog_hardlockup_check() called and saves counter (1000)
* timer runs and updates the timer (1000 -> 1001)
* touch_nmi_watchdog() is called
* CPU locks up
* 10 seconds pass
* watchdog_hardlockup_check() called and saves counter (1001)
* 10 seconds pass
* watchdog_hardlockup_check() called and notices touch
* 10 seconds pass
* watchdog_hardlockup_check() called and finally detects lockup

...so we detect the lockup after 30 seconds, which is pretty bad. With
his new scheme, we'd detect the lockup in 20 seconds.

> > @@ -186,7 +186,21 @@ static void watchdog_hardlockup_kick(void)
> >
> > void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
> > {
> > + bool is_hl;
> > int hardlockup_all_cpu_backtrace;
> > + /*
> > + * Check for a hardlockup by making sure the CPU's timer
> > + * interrupt is incrementing. The timer interrupt should have
> > + * fired multiple times before we overflow'd. If it hasn't
> > + * then this is a good indication the cpu is stuck
> > + *
> > + * Purposely check this _before_ checking watchdog_hardlockup_touched
> > + * so we make sure we still update the saved value of the interrupts.
> > + * Without that we'll take an extra round through this function before
> > + * we can detect a lockup.
> > + */
> > +
> > + is_hl = is_hardlockup(cpu);
> >
> > if (per_cpu(watchdog_hardlockup_touched, cpu)) {
> > per_cpu(watchdog_hardlockup_touched, cpu) = false;
>
> Hmm, this does not look correct to me.
>
> 1. First, let's agree on the meaning of "watchdog_hardlockup_touched".
>
> My understanding is that arch_touch_nmi_watchdog() is called by code
> which might block interrupts (timers) for a long time and wants to
> hide it.
>
> Blocking interrupts for too long is _bad_. In the ideal word,
> nobody should call arch_touch_nmi_watchdog() because we want
> to know about all sinners.
>
> In the real world, we allow to hide some sinners because
> they might produce "false" positives, see touch_nmi_watchdog()
> callers:
>
> + Most callers are printk() related.
>
> We might argue whether it is a false positive or not.
>
> The argument for "touching the watchdog" is that slow serial
> consoles might block IRQs for a long time. But they work as
> expected and can't do better.
>
> Also the stall is kind of expected in this case. We could
> confuse users and/or hide the original problem if the stall
> was reported.
>
> Note that there is a bigger problem with the legacy console
> drivers. printk() tries to emit them immediately. And the
> current console_lock() owner become responsible for flushing
> new messages added by other CPUs in parallel.
>
> It is better with the new NBCON console drivers which are
> offloaded to a kthread. Here, printk() tries to flush them directly
> only when called in an emergency mode (WARN, stall report)
> or in panic().
>
> + There are some other callers, for example, multi_stop_cpu(),
> or hv_do_rep_hypercall_ex(). IMHO, they create stalls on
> purpose.

I guess the problem is that these sinners tend to touch the watchdog
because they _might_ end up blocking too long. ...and they touch the
watchdog constantly. Essentially:

while (something_might_be_slow()) {
touch_nmi_watchdog();
do_something();
}

Even if the operation is _usually_ not slow (like console write), if
the code has a chance of being slow then it will have the touch.

> 2. Let's look at is_hardlockup() in detail:
>
> static bool is_hardlockup(unsigned int cpu)
> {
> int hrint = atomic_read(&per_cpu(hrtimer_interrupts, cpu));
>
> if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint) {
> per_cpu(hrtimer_interrupts_missed, cpu)++;
> if (per_cpu(hrtimer_interrupts_missed, cpu) >= watchdog_hardlockup_miss_thresh)
> return true;
>
> return false;
> }
>
> per_cpu(hrtimer_interrupts_saved, cpu) = hrint;
> per_cpu(hrtimer_interrupts_missed, cpu) = 0;
>
> return false;
> }
>
> If we call it when the watchdog was touched then
> (per_cpu(hrtimer_interrupts_saved, cpu) == hrint)
>
> => per_cpu(hrtimer_interrupts_missed, cpu)++;
>
> is called even when watchdog was touched.
>
> As a result, we might report stall which should have been hidden,
> for example:
>
> CPU0 CPU1
>
> <NMI>
> watchdog_hardlockup_check() # passes
> is_hardlockup() # no
> hr_int_saved = hr_int;
> hr_int_missed = 0;
> </NMI>
>
> local_irq_save()
> printk()
> console_trylock()
> console_unlock()
> console_flush_all()
>
> touch_nmi_watchdog()
>
> // Other CPUs print many messages,
> // e.g. during boot when initializing a lot of HW
> for (i=0; i<1000; i++) do
> printk();
>
> <NMI>
> watchdog_hardlockup_check()
> is_hardlockup() # yes
> hr_int_missed++ # 1
>
> # skip because touched
> </NMI>
>
> touch_nmi_watchdog()
>
> <NMI>
> watchdog_hardlockup_check()
> is_hardlockup() # yes
> hr_int_missed++ # 2
>
> # skip because touched
> </NMI>
>
> ... repeat many times ...
>
> local_irq_restore()
>
> # this might normally trigger handling of pending IRQs
> # including the timers. But IMHO, it can be offloaded
> # to a kthread (at least on RT)
>
> <NMI>
> watchdog_hardlockup_check()
> is_hardlockup() # yes
> hr_int_missed++ # might be already 3, 4,...
>
> Report hardlockup even when all the "hr_int_missed"
> values should have been ignored because of
> touch_watchdog.
>
> </NMI>
>
>
> A solution might be clearing "hrtimer_interrupts_missed"
> when the watchdog was touched.

Great catch! When I was thinking about Mayank's patches, I thought
about them independently. ...and I believe that independently, each
patch is fine. The problem is that together they have exactly the
problem you indicated.

Clearing "hrtimer_interrupts_missed" seems like the right solution in
Mayank's patch #3.

> But honestly, I am not sure if this is worth the complexity.
>
>
> Higher level look:
> ------------------
>
> My understanding is that this patch has an effect only when
> touch_nmi_watchdog() is called as frequently as
> watchdog_hardlockup_check().
>
> The original code gives the system more time to recover after
> a known stall (touch_nmi_watchdog() called).
>
> The new code is more eager to report a stall. It might be more prone
> to report "false" positives.
>
> IMHO, the root of the problem is that touch_nmi_watchdog() is
> called too frequently. And this patch is rather dancing around
> then fixing it.

I don't think it's really any more likely to report false positives
after the bug you pointed out is fixed. The old watchdog was just too
conservative. With Mayank's proposal I think calling
touch_nmi_watchdog() should reset the watchdog the same amount as
letting the hrtimer run once and that seems like a very reasonable
interpretation.

> Alternative:
> ------------
>
> An alternative solution might to detect and report when too many
> watchdog_hardlockup_check() calls are ignored because of
> touch_nmi_watchdog().
>
> It might help to find a mis-use of touch_nmi_watchdog(). The question
> is what details should be reported in this case.
>
> It should be optional because touch_nmi_watchdog() is supposed
> to hide "well-known" sinners after all.

Hmmmm. I certainly support trying to reduce the number of places that
call touch_nmi_watchdog(), but at the same time I don't think Mayank's
patch is "dancing around" the problem. IMO considering the
touch_nmi_watchdog() to be "pretend a timer interrupt fired" is the
intuitive way one would think the call should work. The fact that the
code gave an entire extra 10 seconds before the watchdog could be
caught just feels like a bug that should be fixed.

For the 8250 driver in particular, it looks like the
touch_nmi_watchdog() was removed from serial8250_console_write() as
part of nbcon, but then that got reverted. That would still leave two
other touch_nmi_watchdog() calls in that driver...

-Doug