Re: [patch 5/5] clocksource: Rewrite watchdog code completely
From: Thomas Gleixner
Date: Sun Mar 08 2026 - 06:05:44 EST
On Wed, Feb 25 2026 at 19:13, Jiri Wiesner wrote:
> On Sat, Jan 24, 2026 at 12:18:01AM +0100, Thomas Gleixner wrote:
>> To address this and bring back sanity to the watchdog, rewrite the code
>> completely with a different approach:
>>
>> 1) Restrict the validation against a reference clocksource to the boot
>> CPU, which is usually the CPU/Socket closest to the legacy block which
>> contains the reference source (HPET/ACPI-PM timer).
>
> The UEFI picks the boot CPU so the kernel does not have control over
> that. On the other hand, I think the CPU that is connected to the
> southbridge chip (by DMI or PCIe) will be selected in the majority of
> UEFI implementations.
Picking a remote node CPU would be insane, but yes BIOSes are insane by
definition.
> There is one issue: What if the reference clocksource itself
> experiences time skew? I have seen a case like this with the sgi_rtc
> clocksource. I created a debugging kernel with the HPET as a second
> watchdog (not affecting the decisions by the watchdog) and got this
> result:
>> clocksource: timekeeping watchdog on CPU118: Marking clocksource 'tsc' as unstable because the skew is too large:
>> clocksource: 'sgi_rtc' wd_nsec: 511302794 wd_now: 1cb50e4c4b wd_last: 1ca7097111 mask: ffffffffffffff
>> clocksource: 'hpet' wd2_nsec: 512005960 wd2_now: 65892719 wd2_last: 64c5d684 mask: ffffffff
>> clocksource: 'tsc' cs_nsec: 512006458 cs_now: 86b5982cb1 cs_last: 867581bbab mask: ffffffffffffffff
>> clocksource: 'tsc' skewed 703664 ns (0 ms) over watchdog 'sgi_rtc' interval of 511302794 ns (511 ms)
>> clocksource: 'tsc' is current clocksource.
>> tsc: Marking TSC unstable due to clocksource watchdog
>> clocksource: Checking clocksource tsc synchronization from CPU 610 to CPUs 0-609,611-767.
>> clocksource: Switched to clocksource sgi_rtc
>
> The intervals measured by the TSC and the HPET match very well; the
> sgi_rtc is off. Even the new implementation of the clocksource
> watchdog would be susceptible to the reference clocksource
> experiencing time skew. I think the clocksource watchdog needs to make
> the assumption that the reference clocksource is right, and the onus
> should be on hardware developers to make sure the reference
> clocksource is accurate. In reality, one has to resort to disabling
> the reference clocksource experiencing time skew or, at least,
> decreasing the rating of that clocksource.
Yes, we have to make the assumption that the watchdog clocksource is
actually stable and accurate. If the sgi_rtc is un-reliable, then it
should be rated down. AFAICT it is per blade and I have no idea how
synchronized it is accross blades.
>> +static bool watchdog_check_freq(struct clocksource *cs, bool reset_pending)
>> +{
>> + /*
>> + * Calculate and validate the skew against the allowed PPM
>> + * value of the maximum delta plus the watchdog readout
>> + * time.
>> + */
>> + if (abs(wd_delta - cs_delta) < (max_delta >> ppm_shift) + wd_seq)
>> + return true;
>
> Making the threshold proportional to the length of the interval
> resolves the issue with the (previously) fixed threshold and the
> interval being stretched on account of the timer running later than
> when it was meant to expire.
Indeed.
>> +static void watchdog_check_result(struct clocksource *cs)
>> {
>> - struct clocksource *cs;
>> + switch (watchdog_data.result) {
>> + case WD_SUCCESS:
>> + clocksource_tick_stable(cs);
>> + clocksource_enable_highres(cs);
>> + return;
>>
>> - list_for_each_entry(cs, &watchdog_list, wd_list)
>> + case WD_FREQ_TIMEOUT:
>> + watchdog_print_freq_timeout(cs);
>> + /* Try again later and invalidate the reference timestamps. */
>> cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
>> -}
>> + return;
> I like that the new clocksource watchdog is far less punishing. A
> clocksource may be marked unstable only when the readout latency is
> below 50 us (and there is time skew or unsynchronized CPU
> sockets). There is no need for skipping watchdog checks to mitigate
> the clocksource being marked unstable on account of quite possibly
> unrelated readout latency, SMIs or vCPU preemption.
That was the design goal of that rewrite. Glad you like it.
Thanks,
tglx