Re: [PATCH 1/4] watchdog/hardlockup: Always update saved interrupts during check

From: Doug Anderson

Date: Thu Mar 05 2026 - 11:18:19 EST


Hi,

On Thu, Mar 5, 2026 at 3:27 AM Petr Mladek <pmladek@xxxxxxxx> wrote:
>
> > * watchdog_hardlockup_check() called and saves counter (1000)
> > * timer runs and updates the timer (1000 -> 1001)
> > * touch_nmi_watchdog() is called
> > * CPU locks up
> > * 10 seconds pass
> > * watchdog_hardlockup_check() called and saves counter (1001)
> > * 10 seconds pass
> > * watchdog_hardlockup_check() called and notices touch
>
> Great visualization!
>
> Nit: It seems to be actually the other way around:
>
> * 10 seconds pass
> * watchdog_hardlockup_check() called and notices touch and skips updating counters
> * 10 seconds pass
> * watchdog_hardlockup_check() called and saves counter (1001)

Oops, right! :-) Mayank: it's probably worth adding some form of the
(corrected) example here to the commit message. Also, you could
mention in the commit message that you were seeing real problems
because of the 8250 console prints with the general rule that if
someone asks a question during the a review it's worth including that
info in the next version of the commit message. ;-)


> A better solution might be to separate the check and update/reset
> of the values. Something like (on top of this patchset, just
> compilation tested):
>
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 30199eaeb5d7..4d0851f0f412 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -167,18 +167,10 @@ void watchdog_hardlockup_touch_cpu(unsigned int cpu)
> per_cpu(watchdog_hardlockup_touched, cpu) = true;
> }
>
> -static bool is_hardlockup(unsigned int cpu)
> +static void watchdog_hardlockup_update_reset(unsigned int cpu)
> {
> int hrint = atomic_read(&per_cpu(hrtimer_interrupts, cpu));
>
> - if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint) {
> - per_cpu(hrtimer_interrupts_missed, cpu)++;
> - if (per_cpu(hrtimer_interrupts_missed, cpu) >= watchdog_hardlockup_miss_thresh)
> - return true;
> -
> - return false;
> - }
> -
> /*
> * NOTE: we don't need any fancy atomic_t or READ_ONCE/WRITE_ONCE
> * for hrtimer_interrupts_saved. hrtimer_interrupts_saved is
> @@ -186,8 +178,20 @@ static bool is_hardlockup(unsigned int cpu)
> */
> per_cpu(hrtimer_interrupts_saved, cpu) = hrint;
> per_cpu(hrtimer_interrupts_missed, cpu) = 0;
> +}
>
> - return false;
> +static bool is_hardlockup(unsigned int cpu)
> +{
> + int hrint = atomic_read(&per_cpu(hrtimer_interrupts, cpu));
> +
> + if (per_cpu(hrtimer_interrupts_saved, cpu) != hrint)
> + return false;
> +
> + per_cpu(hrtimer_interrupts_missed, cpu)++;
> + if (per_cpu(hrtimer_interrupts_missed, cpu) < watchdog_hardlockup_miss_thresh)
> + return false;
> +
> + return true;
> }
>
> static void watchdog_hardlockup_kick(void)
> @@ -200,23 +204,10 @@ static void watchdog_hardlockup_kick(void)
>
> void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
> {
> - bool is_hl;
> int hardlockup_all_cpu_backtrace;
> - /*
> - * Check for a hardlockup by making sure the CPU's timer
> - * interrupt is incrementing. The timer interrupt should have
> - * fired multiple times before we overflow'd. If it hasn't
> - * then this is a good indication the cpu is stuck
> - *
> - * Purposely check this _before_ checking watchdog_hardlockup_touched
> - * so we make sure we still update the saved value of the interrupts.
> - * Without that we'll take an extra round through this function before
> - * we can detect a lockup.
> - */
> -
> - is_hl = is_hardlockup(cpu);
>
> if (per_cpu(watchdog_hardlockup_touched, cpu)) {
> + watchdog_hardlockup_update_reset(cpu);
> per_cpu(watchdog_hardlockup_touched, cpu) = false;
> return;
> }
> @@ -224,7 +215,13 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
> hardlockup_all_cpu_backtrace = (hardlockup_si_mask & SYS_INFO_ALL_BT) ?
> 1 : sysctl_hardlockup_all_cpu_backtrace;
>
> - if (is_hl) {
> + /*
> + * Check for a hardlockup by making sure the CPU's timer
> + * interrupt is incrementing. The timer interrupt should have
> + * fired multiple times before we overflow'd. If it hasn't
> + * then this is a good indication the cpu is stuck
> + */
> + if (is_hardlockup(cpu)) {
> unsigned int this_cpu = smp_processor_id();
> unsigned long flags;
>
> @@ -290,6 +287,7 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
>
> per_cpu(watchdog_hardlockup_warned, cpu) = true;
> } else {
> + watchdog_hardlockup_update_reset(cpu);
> per_cpu(watchdog_hardlockup_warned, cpu) = false;
> }
> }

I haven't tested it, but that actually looks like a pretty nice final
result to me. Mayank: What do you think? You'd have to figure out how
to rework your two patches to incorporate Petr's ideas.

Petr: Since you gave your ideas as a diff, what are you thinking in
terms of tags on Mayank's v2? You didn't provide a Signed-off-by on
your diff, so I guess you're expecting Mayank not to incorprate it
directly but take it as a "suggestion" for improving his patches (AKA
not add any of your tags to his v2).

One nit: in the final result, it might be nice to invert the
"is_hardlockup()" test so we can return early and get rid of a level
of indentation. AKA:

if (!is_hardlockup(cpu)) {
watchdog_hardlockup_update_reset(cpu);
per_cpu(watchdog_hardlockup_warned, cpu) = false;
return;
}

Not only does it reduce indentation, but it also keeps the two calls
to watchdog_hardlockup_update_reset() closer to each other.

-Doug