Re: [patch 5/5] clocksource: Rewrite watchdog code completely

From: Thomas Gleixner

Date: Mon Feb 02 2026 - 06:28:13 EST

On Mon, Feb 02 2026 at 14:45, Daniel J. Blueman wrote:
> Great work Thomas!

Thank you!

> On Sat, 24 Jan 2026 at 07:18, Thomas Gleixner <tglx@xxxxxxxxxx> wrote:
>> 2) Compare the TSCs of the other CPUs in a round robin fashion against
>> the boot CPU in the same way the TSC synchronization on CPU hotplug
>> works. This still can suffer from delayed reaction of the remote CPU
>> to the SMP function call and the latency of the control variable cache
>> line. But this latency is not affecting correctness. It only affects
>> the accuracy. With low contention the readout latency is in the low
>> nanoseconds range, which detects even slight skews between CPUs. Under
>> high contention this becomes obviously less accurate, but still
>> detects slow skews reliably as it solely relies on subsequent readouts
>> being monotonically increasing. It just can take slightly longer to
>> detect the issue.
>
> On x86, I agree iterating at a per-thread level is needed rather than
> one thread per NUMA node, since the TSC_ADJUST architectural MSR is
> per-core and we want detection completeness.

It's per thread not per core.

But that aside the TSC_ADJUST integrity is already self monitored
independent of the watchdog. See tsc_verify_tsc_adjust(). So we might
get away with a per socket check as all threads of a socket are fed by
the same ART (Always Running Timer) and the main concern is that the
ARTs of sockets drift apart especially on systems with more than four
sockets.

> On other architectures, completeness could be traded off for lower
> overhead if it is guaranteed each processor thread uses the same clock
> value, though this is actually is what the clocksource watchdog seeks
> to validate, so agreed on the current approach there too.

x86 is the only one which actually utilizes the watchdog.

>> +/* Maximum time between two watchdog readouts */
>> +#define WATCHDOG_READOUT_MAX_NS (50 * NSEC_PER_USEC)

> At 1920 threads, the default timeout threshold of 20us triggers
> continuous warnings at idle, however 1000us causes none under an 8
> hour adverse workload [1]; no HPET fallback was seen. A 500us
> threshold causes a low rate of timeouts [2] (overhead amplified due to
> retries), thus 1000us adds margin and should prevent retries.

Right. Idle is definitely an issue when the remote CPU is in a deep
C-state.

My concern is that the control CPU might spin there for a millisecond
with interrupts disabled, which is not really desired especially not on
RT systems.

Something like the untested below delta patch should work.

Thanks,

tglx
---
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -7,6 +7,7 @@

#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

+#include <linux/delay.h>
#include <linux/device.h>
#include <linux/clocksource.h>
#include <linux/init.h>
@@ -124,7 +125,8 @@ static atomic_t watchdog_reset_pending;
#define WATCHDOG_INTERVAL_NS (WATCHDOG_INTERVAL * (NSEC_PER_SEC / HZ))

/* Maximum time between two watchdog readouts */
-#define WATCHDOG_READOUT_MAX_NS (50 * NSEC_PER_USEC)
+#define WATCHDOG_READOUT_MAX_US 50
+#define WATCHDOG_READOUT_MAX_NS (WATCHDOG_READOUT_MAX_US * NSEC_PER_USEC)

/* Shift values to calculate the approximate $N ppm of a given delta. */
#define SHIFT_500PPM 11
@@ -136,6 +138,9 @@ static atomic_t watchdog_reset_pending;
/* Five reads local and remote for inter CPU skew detection */
#define WATCHDOG_REMOTE_MAX_SEQ 10

+/* Number of attempts to synchronize with a remote CPU */
+#define WATCHDOG_REMOTE_RETRIES 10
+
static inline void clocksource_watchdog_lock(unsigned long *flags)
{
spin_lock_irqsave(&watchdog_lock, *flags);
@@ -336,22 +341,17 @@ static void watchdog_check_skew_remote(v
atomic_dec(&wd->remote_inprogress);
}

-static void watchdog_check_cpu_skew(struct clocksource *cs)
+static inline bool wd_csd_locked(struct watchdog_cpu_data *wd)
{
- unsigned int cpu = cpumask_next_wrap(watchdog_data.curr_cpu, cpu_online_mask);
- struct watchdog_cpu_data *wd;
-
- watchdog_data.curr_cpu = cpu;
- /* Skip the current CPU. Handles num_online_cpus() == 1 as well */
- if (cpu == smp_processor_id())
- return;
+ return READ_ONCE(wd->csd.node.u_flags) & CSD_FLAG_LOCK;
+}

- /* Don't interfere with the test mechanics */
- if ((cs->flags & CLOCK_SOURCE_WDTEST) && !(cs->flags & CLOCK_SOURCE_WDTEST_PERCPU))
- return;
+static void __watchdog_check_cpu_skew(struct clocksource *cs, unsigned int cpu)
+{
+ struct watchdog_cpu_data *wd;

wd = per_cpu_ptr(&watchdog_cpu_data, cpu);
- if (atomic_read(&wd->remote_inprogress)) {
+ if (atomic_read(&wd->remote_inprogress) || wd_csd_locked(wd)) {
watchdog_data.result = WD_CPU_TIMEOUT;
return;
}
@@ -377,6 +377,29 @@ static void watchdog_check_cpu_skew(stru
}
}

+static void watchdog_check_cpu_skew(struct clocksource *cs)
+{
+ unsigned int cpu = cpumask_next_wrap(watchdog_data.curr_cpu, cpu_online_mask);
+
+ watchdog_data.curr_cpu = cpu;
+ /* Skip the current CPU. Handles num_online_cpus() == 1 as well */
+ if (cpu == smp_processor_id())
+ return;
+
+ /* Don't interfere with the test mechanics */
+ if ((cs->flags & CLOCK_SOURCE_WDTEST) && !(cs->flags & CLOCK_SOURCE_WDTEST_PERCPU))
+ return;
+
+ for (int i = 0; i < WATCHDOG_REMOTE_RETRIES; i++) {
+ __watchdog_check_cpu_skew(cs, cpu);
+
+ if (watchdog_data.result != WD_CPU_TIMEOUT)
+ return;
+
+ udelay(WATCHDOG_READOUT_MAX_US);
+ }
+}
+
static bool watchdog_check_freq(struct clocksource *cs, bool reset_pending)
{
unsigned int ppm_shift = SHIFT_4000PPM;