Re: [PATCH 1/4] watchdog/hardlockup: Always update saved interrupts during check

From: Mayank Rungta

Date: Tue Mar 10 2026 - 22:52:00 EST


Thanks guys for ironing out the details. I tried to implement the
ideas mentioned in this thread

> On Mon, Mar 9, 2026 at 6:33 AM Petr Mladek <pmladek@xxxxxxxx> wrote:
>
> On Thu 2026-03-05 08:13:39, Doug Anderson wrote:
> > Hi,
> >
> > On Thu, Mar 5, 2026 at 3:27 AM Petr Mladek <pmladek@xxxxxxxx> wrote:
> > >
> > > > * watchdog_hardlockup_check() called and saves counter (1000)
> > > > * timer runs and updates the timer (1000 -> 1001)
> > > > * touch_nmi_watchdog() is called
> > > > * CPU locks up
> > > > * 10 seconds pass
> > > > * watchdog_hardlockup_check() called and saves counter (1001)
> > > > * 10 seconds pass
> > > > * watchdog_hardlockup_check() called and notices touch
> > >
> > > Great visualization!
> > >
> > > Nit: It seems to be actually the other way around:
> > >
> > > * 10 seconds pass
> > > * watchdog_hardlockup_check() called and notices touch and skips updating counters
> > > * 10 seconds pass
> > > * watchdog_hardlockup_check() called and saves counter (1001)
> >
> > Oops, right! :-) Mayank: it's probably worth adding some form of the
> > (corrected) example here to the commit message. Also, you could
> > mention in the commit message that you were seeing real problems
> > because of the 8250 console prints with the general rule that if
> > someone asks a question during the a review it's worth including that
> > info in the next version of the commit message. ;-)
> >

That's a good idea, sure I will add this to the commit message for V2.

> >
> > > A better solution might be to separate the check and update/reset
> > > of the values. Something like (on top of this patchset, just
> > > compilation tested):
> > >
> > > diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> > > index 30199eaeb5d7..4d0851f0f412 100644
> > > --- a/kernel/watchdog.c
> > > +++ b/kernel/watchdog.c
> > > @@ -167,18 +167,10 @@ void watchdog_hardlockup_touch_cpu(unsigned int cpu)
> > > per_cpu(watchdog_hardlockup_touched, cpu) = true;
> > > }
> > >
> > > -static bool is_hardlockup(unsigned int cpu)
> > > +static void watchdog_hardlockup_update_reset(unsigned int cpu)
> > > {
> > > int hrint = atomic_read(&per_cpu(hrtimer_interrupts, cpu));
> > >
> > > - if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint) {
> > > - per_cpu(hrtimer_interrupts_missed, cpu)++;
> > > - if (per_cpu(hrtimer_interrupts_missed, cpu) >= watchdog_hardlockup_miss_thresh)
> > > - return true;
> > > -
> > > - return false;
> > > - }
> > > -
> > > /*
> > > * NOTE: we don't need any fancy atomic_t or READ_ONCE/WRITE_ONCE
> > > * for hrtimer_interrupts_saved. hrtimer_interrupts_saved is
> > > @@ -186,8 +178,20 @@ static bool is_hardlockup(unsigned int cpu)
> > > */
> > > per_cpu(hrtimer_interrupts_saved, cpu) = hrint;
> > > per_cpu(hrtimer_interrupts_missed, cpu) = 0;
> > > +}
> > >
> > > - return false;
> > > +static bool is_hardlockup(unsigned int cpu)
> > > +{
> > > + int hrint = atomic_read(&per_cpu(hrtimer_interrupts, cpu));
> > > +
> > > + if (per_cpu(hrtimer_interrupts_saved, cpu) != hrint)
> > > + return false;
> > > +
> > > + per_cpu(hrtimer_interrupts_missed, cpu)++;
> > > + if (per_cpu(hrtimer_interrupts_missed, cpu) < watchdog_hardlockup_miss_thresh)
> > > + return false;
> > > +
> > > + return true;
> > > }
> > >
> > > static void watchdog_hardlockup_kick(void)
> > > @@ -200,23 +204,10 @@ static void watchdog_hardlockup_kick(void)
> > >
> > > void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
> > > {
> > > - bool is_hl;
> > > int hardlockup_all_cpu_backtrace;
> > > - /*
> > > - * Check for a hardlockup by making sure the CPU's timer
> > > - * interrupt is incrementing. The timer interrupt should have
> > > - * fired multiple times before we overflow'd. If it hasn't
> > > - * then this is a good indication the cpu is stuck
> > > - *
> > > - * Purposely check this _before_ checking watchdog_hardlockup_touched
> > > - * so we make sure we still update the saved value of the interrupts.
> > > - * Without that we'll take an extra round through this function before
> > > - * we can detect a lockup.
> > > - */
> > > -
> > > - is_hl = is_hardlockup(cpu);
> > >
> > > if (per_cpu(watchdog_hardlockup_touched, cpu)) {
> > > + watchdog_hardlockup_update_reset(cpu);
> > > per_cpu(watchdog_hardlockup_touched, cpu) = false;
> > > return;
> > > }
> > > @@ -224,7 +215,13 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
> > > hardlockup_all_cpu_backtrace = (hardlockup_si_mask & SYS_INFO_ALL_BT) ?
> > > 1 : sysctl_hardlockup_all_cpu_backtrace;
> > >
> > > - if (is_hl) {
> > > + /*
> > > + * Check for a hardlockup by making sure the CPU's timer
> > > + * interrupt is incrementing. The timer interrupt should have
> > > + * fired multiple times before we overflow'd. If it hasn't
> > > + * then this is a good indication the cpu is stuck
> > > + */
> > > + if (is_hardlockup(cpu)) {
> > > unsigned int this_cpu = smp_processor_id();
> > > unsigned long flags;
> > >
> > > @@ -290,6 +287,7 @@ void watchdog_hardlockup_check(unsigned int cpu, struct pt_regs *regs)
> > >
> > > per_cpu(watchdog_hardlockup_warned, cpu) = true;
> > > } else {
> > > + watchdog_hardlockup_update_reset(cpu);
> > > per_cpu(watchdog_hardlockup_warned, cpu) = false;
> > > }
> > > }
> >
> > I haven't tested it, but that actually looks like a pretty nice final
> > result to me. Mayank: What do you think? You'd have to figure out how
> > to rework your two patches to incorporate Petr's ideas.
> >

Thanks for your suggestion, this is pretty close, but we cannot call
watchdog_hardlockup_update_reset(cpu) in a general else block. If we
did, the hrtimer_interrupts_missed count would be reset on every check
that isn't a hardlockup; even when no progress was actually made and
we would never hit the watchdog_threshold.

I’ve moved the update/reset inside the "progress detected" path within
is_hardlockup() instead:

static bool is_hardlockup(unsigned int cpu)
{
int hrint = atomic_read(&per_cpu(hrtimer_interrupts, cpu));

if (per_cpu(hrtimer_interrupts_saved, cpu) != hrint) {
watchdog_hardlockup_update_reset(cpu);
return false;
}

per_cpu(hrtimer_interrupts_missed, cpu)++;
if (per_cpu(hrtimer_interrupts_missed, cpu) <
watchdog_hardlockup_miss_thresh)
return false;

return true;
}

> > Petr: Since you gave your ideas as a diff, what are you thinking in
> > terms of tags on Mayank's v2? You didn't provide a Signed-off-by on
> > your diff, so I guess you're expecting Mayank not to incorprate it
> > directly but take it as a "suggestion" for improving his patches (AKA
> > not add any of your tags to his v2).
>
> I expected that Mayank could rework his patchset using ideas from the
> diff. Feel free to use the changes as they are and copy&paste them
> from my diff. It is just a refactoring.
>
> > One nit: in the final result, it might be nice to invert the
> > "is_hardlockup()" test so we can return early and get rid of a level
> > of indentation. AKA:
> >
> > if (!is_hardlockup(cpu)) {
> > watchdog_hardlockup_update_reset(cpu);
> > per_cpu(watchdog_hardlockup_warned, cpu) = false;
> > return;
> > }
> >
> > Not only does it reduce indentation, but it also keeps the two calls
> > to watchdog_hardlockup_update_reset() closer to each other.
>
> Yeah, that would be great. I actually wanted to do it in my diff
> as well. But I did not do it to keep the diff simple.
>
> It might be better to invert the logic as a separate preparation
> patch so that we do not hide other changes in the reshuffling.
>
> Best Regards,
> Petr

I still went ahead and flipped the `is_hardlockup` test in the main
check function to keep the indentation clean