Re: [PATCH 3/4] watchdog/hardlockup: improve buddy system detection timeliness

From: Petr Mladek

Date: Thu Mar 05 2026 - 08:49:13 EST


On Thu 2026-02-12 14:12:12, Mayank Rungta via B4 Relay wrote:
> From: Mayank Rungta <mrungta@xxxxxxxxxx>
>
> Currently, the buddy system only performs checks every 3rd sample. With
> a 4-second interval. If a check window is missed, the next check occurs
> 12 seconds later, potentially delaying hard lockup detection for up to
> 24 seconds.
>
> Modify the buddy system to perform checks at every interval (4s).
> Introduce a missed-interrupt threshold to maintain the existing grace
> period while reducing the detection window to 8-12 seconds.
>
> Best and worst case detection scenarios:
>
> Before (12s check window):
> - Best case: Lockup occurs after first check but just before heartbeat
> interval. Detected in ~8s (8s till next check).
> - Worst case: Lockup occurs just after a check.
> Detected in ~24s (missed check + 12s till next check + 12s logic).
>
> After (4s check window with threshold of 3):
> - Best case: Lockup occurs just before a check.
> Detected in ~8s (0s till 1st check + 4s till 2nd + 4s till 3rd).
> - Worst case: Lockup occurs just after a check.
> Detected in ~12s (4s till 1st check + 4s till 2nd + 4s till 3rd).

One might argue that the interval <8s,24s> is not much worse than
<6s,20s> achieved by the perf detector.

But I personally like that the disperse of <8s,12s> is lower so that
the result is more predictable. And it is relatively cheap.

People might have different option. But I am fine with this change.

> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -163,8 +171,13 @@ static bool is_hardlockup(unsigned int cpu)
> {
> int hrint = atomic_read(&per_cpu(hrtimer_interrupts, cpu));
>
> - if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint)
> - return true;
> + if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint) {
> + per_cpu(hrtimer_interrupts_missed, cpu)++;
> + if (per_cpu(hrtimer_interrupts_missed, cpu) >= watchdog_hardlockup_miss_thresh)

This would return true for every check when missed >= 3.
As a result, the hardlockup would be reported every 4s.

I would keep the 12s cadence and change this to:

if (per_cpu(hrtimer_interrupts_missed, cpu) % watchdog_hardlockup_miss_thresh == 0)

> + return true;
> +
> + return false;
> + }
>
> /*
> * NOTE: we don't need any fancy atomic_t or READ_ONCE/WRITE_ONCE
> --- a/kernel/watchdog_buddy.c
> +++ b/kernel/watchdog_buddy.c
> @@ -86,14 +87,6 @@ void watchdog_buddy_check_hardlockup(int hrtimer_interrupts)
> {
> unsigned int next_cpu;
>
> - /*
> - * Test for hardlockups every 3 samples. The sample period is
> - * watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly over
> - * watchdog_thresh (over by 20%).
> - */
> - if (hrtimer_interrupts % 3 != 0)
> - return;

It would be symetric with the "% 3" above.

> -
> /* check for a hardlockup on the next CPU */
> next_cpu = watchdog_next_cpu(smp_processor_id());
> if (next_cpu >= nr_cpu_ids)

Best Regards,
Petr