Re: [patch 5/5] clocksource: Rewrite watchdog code completely

From: Jiri Wiesner

Date: Wed Feb 25 2026 - 13:14:14 EST

On Sat, Jan 24, 2026 at 12:18:01AM +0100, Thomas Gleixner wrote:
> To address this and bring back sanity to the watchdog, rewrite the code
> completely with a different approach:
>
> 1) Restrict the validation against a reference clocksource to the boot
> CPU, which is usually the CPU/Socket closest to the legacy block which
> contains the reference source (HPET/ACPI-PM timer).

The UEFI picks the boot CPU so the kernel does not have control over that. On the other hand, I think the CPU that is connected to the southbridge chip (by DMI or PCIe) will be selected in the majority of UEFI implementations. Even if the boot CPU had to use the inter-processor link the readout latency should often pass the 50 microsecond threshold. This is a histogram of the hpet-tsc-hpet readout latency (in nanoseconds) as measured by the old clocksource watchdog (reads carried out from all CPUs on the machine):

wd_delay Duration Distribution
wd_delay Duration Average: 7822 +- 9413 (min 2875, max 77916)
Range Count
0 - 5000 2766 (73.10%)
5000 - 10000 383 (10.12%)
20000 - 25000 402 (10.62%)
25000 - 30000 94 ( 2.48%)
30000 - 35000 49 ( 1.29%)
35000 - 40000 35 ( 0.92%)
40000 - 45000 21 ( 0.55%)
45000 - 50000 14 ( 0.37%)
50000 - 55000 7 ( 0.18%)
55000 - 60000 3 ( 0.08%)
60000 - 65000 4 ( 0.11%)
65000 - 70000 2 ( 0.05%)
70000 - 75000 1 ( 0.03%)
75000 - 80000 3 ( 0.08%)
Total count: 3,784

The machine has 8 NUMA nodes with 960x Intel Xeon Platinum 8490H. The machine was running:
stress-ng -t 30m --cpu 480 --switch 520
It definitely does not represent effect of any arbitrary workload on the inter-processor link but it is a data point.

There is one issue: What if the reference clocksource itself experiences time skew? I have seen a case like this with the sgi_rtc clocksource. I created a debugging kernel with the HPET as a second watchdog (not affecting the decisions by the watchdog) and got this result:
> clocksource: timekeeping watchdog on CPU118: Marking clocksource 'tsc' as unstable because the skew is too large:
> clocksource: 'sgi_rtc' wd_nsec: 511302794 wd_now: 1cb50e4c4b wd_last: 1ca7097111 mask: ffffffffffffff
> clocksource: 'hpet' wd2_nsec: 512005960 wd2_now: 65892719 wd2_last: 64c5d684 mask: ffffffff
> clocksource: 'tsc' cs_nsec: 512006458 cs_now: 86b5982cb1 cs_last: 867581bbab mask: ffffffffffffffff
> clocksource: 'tsc' skewed 703664 ns (0 ms) over watchdog 'sgi_rtc' interval of 511302794 ns (511 ms)
> clocksource: 'tsc' is current clocksource.
> tsc: Marking TSC unstable due to clocksource watchdog
> clocksource: Checking clocksource tsc synchronization from CPU 610 to CPUs 0-609,611-767.
> clocksource: Switched to clocksource sgi_rtc

The intervals measured by the TSC and the HPET match very well; the sgi_rtc is off. Even the new implementation of the clocksource watchdog would be susceptible to the reference clocksource experiencing time skew. I think the clocksource watchdog needs to make the assumption that the reference clocksource is right, and the onus should be on hardware developers to make sure the reference clocksource is accurate. In reality, one has to resort to disabling the reference clocksource experiencing time skew or, at least, decreasing the rating of that clocksource.

> +static bool watchdog_check_freq(struct clocksource *cs, bool reset_pending)
> +{
> + /*
> + * Calculate and validate the skew against the allowed PPM
> + * value of the maximum delta plus the watchdog readout
> + * time.
> + */
> + if (abs(wd_delta - cs_delta) < (max_delta >> ppm_shift) + wd_seq)
> + return true;

Making the threshold proportional to the length of the interval resolves the issue with the (previously) fixed threshold and the interval being stretched on account of the timer running later than when it was meant to expire.

> +static void watchdog_check_result(struct clocksource *cs)
> {
> - struct clocksource *cs;
> + switch (watchdog_data.result) {
> + case WD_SUCCESS:
> + clocksource_tick_stable(cs);
> + clocksource_enable_highres(cs);
> + return;
>
> - list_for_each_entry(cs, &watchdog_list, wd_list)
> + case WD_FREQ_TIMEOUT:
> + watchdog_print_freq_timeout(cs);
> + /* Try again later and invalidate the reference timestamps. */
> cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
> -}
> + return;

I like that the new clocksource watchdog is far less punishing. A clocksource may be marked unstable only when the readout latency is below 50 us (and there is time skew or unsynchronized CPU sockets). There is no need for skipping watchdog checks to mitigate the clocksource being marked unstable on account of quite possibly unrelated readout latency, SMIs or vCPU preemption.

--
Jiri Wiesner
SUSE Labs