Re: [PATCH] Clocksource: Avoid misjudgment of clocksource
From: Paul E. McKenney
Date: Fri Oct 22 2021 - 19:36:58 EST
On Thu, Oct 21, 2021 at 05:37:24AM -0400, Luming Yu wrote:
> On Wed, Oct 20, 2021 at 1:49 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> >
> > On Wed, Oct 20, 2021 at 06:09:58AM -0400, Luming Yu wrote:
> > > On Tue, Oct 19, 2021 at 1:04 AM John Stultz <john.stultz@xxxxxxxxxx> wrote:
> > > >
> > > > On Mon, Oct 18, 2021 at 9:14 PM yanghui <yanghui.def@xxxxxxxxxxxxx> wrote:
> > > > > 在 2021/10/19 上午12:14, John Stultz 写道:
> > > > > > On Tue, Oct 12, 2021 at 1:06 AM brookxu <brookxu.cn@xxxxxxxxx> wrote:
> > > > > >> John Stultz wrote on 2021/10/12 13:29:
> > > > > >>> On Mon, Oct 11, 2021 at 10:23 PM brookxu <brookxu.cn@xxxxxxxxx> wrote:
> > > > > >>>> John Stultz wrote on 2021/10/12 12:52 下午:
> > > > > >>>>> On Sat, Oct 9, 2021 at 7:04 AM brookxu <brookxu.cn@xxxxxxxxx> wrote:
> > > > > >>>> If we record the watchdog's start_time in clocksource_start_watchdog(), and then
> > > > > >>>> when we verify cycles in clocksource_watchdog(), check whether the clocksource
> > > > > >>>> watchdog is blocked. Due to MSB verification, if the blocked time is greater than
> > > > > >>>> half of the watchdog timer max_cycles, then we can safely ignore the current
> > > > > >>>> verification? Do you think this idea is okay?
> > > > > >>>
> > > > > >>> I can't say I totally understand the idea. Maybe could you clarify with a patch?
> > > > > >>>
> > > > > >>
> > > > > >> Sorry, it looks almost as follows:
> > > > > >>
> > > > > >> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > > > > >> index b8a14d2..87f3b67 100644
> > > > > >> --- a/kernel/time/clocksource.c
> > > > > >> +++ b/kernel/time/clocksource.c
> > > > > >> @@ -119,6 +119,7 @@
> > > > > >> static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
> > > > > >> static DEFINE_SPINLOCK(watchdog_lock);
> > > > > >> static int watchdog_running;
> > > > > >> +static unsigned long watchdog_start_time;
> > > > > >> static atomic_t watchdog_reset_pending;
> > > > > >>
> > > > > >> static inline void clocksource_watchdog_lock(unsigned long *flags)
> > > > > >> @@ -356,6 +357,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > > > >> int next_cpu, reset_pending;
> > > > > >> int64_t wd_nsec, cs_nsec;
> > > > > >> struct clocksource *cs;
> > > > > >> + unsigned long max_jiffies;
> > > > > >> u32 md;
> > > > > >>
> > > > > >> spin_lock(&watchdog_lock);
> > > > > >> @@ -402,6 +404,10 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > > > >> if (atomic_read(&watchdog_reset_pending))
> > > > > >> continue;
> > > > > >>
> > > > > >> + max_jiffies = nsecs_to_jiffies(cs->max_idle_ns);
> > > > > >> + if (time_is_before_jiffies(watchdog_start_time + max_jiffies))
> > > > > >> + continue;
> > > > > >> +
> > > > > >
> > > > > > Sorry, what is the benefit of using jiffies here? Jiffies are
> > > > > > updated by counting the number of tick intervals on the current
> > > > > > clocksource.
> > > > > >
> > > > > > This seems like circular logic, where we're trying to judge the
> > > > > > current clocksource by using something we derived from the current
> > > > > > clocksource.
> > > > > > That's why the watchdog clocksource is important, as it's supposed to
> > > > > > be a separate counter that is more reliable (but likely slower) then
> > > > > > the preferred clocksource.
> > > > > >
> > > > > > So I'm not really sure how this helps.
> > > > > >
> > > > > > The earlier patch by yanghui at least used the watchdog interval to
> > > > > > decide if the watchdog timer had expired late. Which seemed
> > > > > > reasonable, but I thought it might be helpful to add some sort of a
> > > > > > counter so if the case is happening repeatedly (timers constantly
> > > > > > being delayed) we have a better signal that the watchdog and current
> > > > > > clocksource are out of sync. Because again, timers are fired based on
> > > > >
> > > > > I think only have a signal ls not enough. we need to prevent
> > > > > clocksource from being incorrectly switched.
> > > >
> > > > Right, but we also have to ensure that we also properly disqualify
> > > > clocksources that are misbehaving.
> > > >
> > > > In the case that the current clocksource is running very slow (imagine
> > > > old TSCs that lowered freq with cpufreq), then system time slows down,
> > > > so timers fire late.
> > > > So it would constantly seem like the irqs are being delayed, so with
> > > > your logic we would not disqualify a clearly malfunctioning
> > > > clocksource..
> > > >
> > > > > The Timer callback function clocksource_watchdog() is executed in the
> > > > > context of softirq(run_timer_softirq()). So if softirq is disabled for
> > > > > long time(One situation is long time softlockup), clocksource_watchdog()
> > > > > will be delay executed.
> > > >
> > > > Yes. The reality is that timers are often spuriously delayed. We don't
> > > > want a short burst of timer misbehavior to disqualify a good
> > > > clocksource.
> > > >
> > > > But the problem is that this situation and the one above (with the
> > > > freq changing TSC), will look exactly the same.
> > > >
> > > > So having a situation where if the watchdog clocksource thinks too
> > > > much time has passed between watchdog timers, we can skip judgement,
> > > > assuming its a spurious delay. But I think we need to keep a counter
> > > > so that if this happens 3-5 times in a row, we stop ignoring the
> > > > misbehavior and judge the current clocksource, as it may be running
> > > > slowly.
> > > >
> > > > > >
> > > > > I think it will be better to add this to my patch:
> > > > > /*
> > > > > * Interval: 0.5sec.
> > > > > - * MaxInterval: 1s.
> > > > > + * MaxInterval: 20s.
> > > > > */
> > > > > #define WATCHDOG_INTERVAL (HZ >> 1)
> > > > > -#define WATCHDOG_MAX_INTERVAL_NS (NSEC_PER_SEC)
> > > > > +#define WATCHDOG_MAX_INTERVAL_NS (20 * NSEC_PER_SEC)
> > > > >
> > > >
> > > > Some watchdog counters wrap within 20 seconds, so I don't think this
> > > > is a good idea.
> > > >
> > > > The other proposal to calculate the error rate, rather than a fixed
> > > > error boundary might be useful too, as if the current clocksource and
> > > > watchdog are close, a long timer delay won't disqualify them if we
> > > > scale the error bounds to be within an given error rate.
> > >
> > > In most of tsc unstable trouble shooting on modern servers we experienced,
> > > it usually ends up in a false alarm triggered by the clock source
> > > watchdog for tsc.
> > >
> > > I think Paul has a proposal to make a clock source watchdog to be more
> > > intelligent.
> > > Its job is to find a real problem instead of causing a problem.
> >
> > And that proposal is now in mainline:
>
> Great! : -)
> >
> > 22a223833716 clocksource: Print deviation in nanoseconds when a clocksource becomes unstable
> > 1253b9b87e42 clocksource: Provide kernel module to test clocksource watchdog
> > 2e27e793e280 clocksource: Reduce clocksource-skew threshold
> > fa218f1cce6b clocksource: Limit number of CPUs checked for clock synchronization
> > 7560c02bdffb clocksource: Check per-CPU clock synchronization when marked unstable
> > db3a34e17433 clocksource: Retry clock read if long delays detected
> >
> > The strategy is to disqualify a clock comparison if the reads took too
> > long, and to retry the comparison. If four consecutive comparison take
> > too long, clock skew is reported. The number of consecutive comparisons
> > may be adjusted by the usual kernel boot parameter.
> >
> > > so disabling it for known good-tsc might be a reasonable good idea
> > > that can save manpower for other
> > > more valuable problems to solve, or at least make it statistically a
> > > problem less chance to happen.
> >
> > One additional piece that is still in prototype state in -rcu is to give
> > clocksources some opportunity to resynchronize if there are too many
> > slow comparisons. This is intended to handle cases where clocks often
>
> if there is such tsc-sync algorithm existing in software, it really
> can help system software engineers
> to solve some rare power good signals synchronization problem caused
> by bios that caused
> boot time tsc sync check failure that usually would consume huge
> debugging engine for bringing up qualified linux system.
>
> Less depending on platform quirks should be good thing to linux for
> tsc && rcu support.
Good point, I have procrastinated long enough.
How about like this?
Thanx, Paul
------------------------------------------------------------------------
commit 9ec2a03bbf4bee3d9fbc02a402dee36efafc5a2d
Author: Paul E. McKenney <paulmck@xxxxxxxxxx>
Date: Thu May 27 11:03:28 2021 -0700
clocksource: Forgive repeated long-latency watchdog clocksource reads
Currently, the clocksource watchdog reacts to repeated long-latency
clocksource reads by marking that clocksource unstable on the theory that
these long-latency reads are a sign of a serious problem. And this theory
does in fact have real-world support in the form of firmware issues [1].
However, it is also possible to trigger this using stress-ng on what
the stress-ng man page terms "poorly designed hardware" [2]. And it
is not necessarily a bad thing for the kernel to diagnose cases where
heavy memory-contention workloads are being run on hardware that is not
designed for this sort of use.
Nevertheless, it is quite possible that real-world use will result in
some situation requiring that high-stress workloads run on hardware
not designed to accommodate them, and also requiring that the kernel
refrain from marking clocksources unstable.
Therefore, react to persistent long-latency reads by leaving the
clocksource alone, but using the old 62.5-millisecond skew-detection
threshold. In addition, the offending clocksource is marked for
re-initialization, which both restarts that clocksource with a clean bill
of health and avoids false-positive skew reports on later watchdog checks.
Once marked for re-initialization, the clocksource is not subjected to
further watchdog checking until a subsequent successful read from that
clocksource that is free of excessive delays.
However, if clocksource.max_cswd_coarse_reads consecutive clocksource read
attempts result in long latencies, a warning (splat) will be emitted.
This kernel boot parameter defaults to 100, and this warning can be
disabled by setting it to zero or to a negative value.
[ paulmck: Apply feedback from Chao Gao ]
Link: https://lore.kernel.org/lkml/20210513155515.GB23902@xsang-OptiPlex-9020/ # [1]
Link: https://lore.kernel.org/lkml/20210521083322.GG25531@xsang-OptiPlex-9020/ # [2]
Link: https://lore.kernel.org/lkml/20210521084405.GH25531@xsang-OptiPlex-9020/
Link: https://lore.kernel.org/lkml/20210511233403.GA2896757@paulmck-ThinkPad-P17-Gen-1/
Tested-by: Chao Gao <chao.gao@xxxxxxxxx>
Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 316027c3aadc..61d2436ae9df 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -600,6 +600,14 @@
loops can be debugged more effectively on production
systems.
+ clocksource.max_cswd_coarse_reads= [KNL]
+ Number of consecutive clocksource_watchdog()
+ coarse reads (that is, clocksource reads that
+ were unduly delayed) that are permitted before
+ the kernel complains (gently). Set to a value
+ less than or equal to zero to suppress these
+ complaints.
+
clocksource.max_cswd_read_retries= [KNL]
Number of clocksource_watchdog() retries due to
external delays before the clock will be marked
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 1d42d4b17327..3e925d9ffc31 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -110,6 +110,7 @@ struct clocksource {
int rating;
enum clocksource_ids id;
enum vdso_clock_mode vdso_clock_mode;
+ unsigned int n_coarse_reads;
unsigned long flags;
int (*enable)(struct clocksource *cs);
@@ -291,6 +292,7 @@ static inline void timer_probe(void) {}
#define TIMER_ACPI_DECLARE(name, table_id, fn) \
ACPI_DECLARE_PROBE_ENTRY(timer, name, table_id, 0, NULL, 0, fn)
+extern int max_cswd_coarse_reads;
extern ulong max_cswd_read_retries;
void clocksource_verify_percpu(struct clocksource *cs);
diff --git a/kernel/time/clocksource-wdtest.c b/kernel/time/clocksource-wdtest.c
index df922f49d171..7e82500c400b 100644
--- a/kernel/time/clocksource-wdtest.c
+++ b/kernel/time/clocksource-wdtest.c
@@ -145,13 +145,12 @@ static int wdtest_func(void *arg)
else if (i <= max_cswd_read_retries)
s = ", expect message";
else
- s = ", expect clock skew";
+ s = ", expect coarse-grained clock skew check and re-initialization";
pr_info("--- Watchdog with %dx error injection, %lu retries%s.\n", i, max_cswd_read_retries, s);
WRITE_ONCE(wdtest_ktime_read_ndelays, i);
schedule_timeout_uninterruptible(2 * HZ);
WARN_ON_ONCE(READ_ONCE(wdtest_ktime_read_ndelays));
- WARN_ON_ONCE((i <= max_cswd_read_retries) !=
- !(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE));
+ WARN_ON_ONCE(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE);
wdtest_ktime_clocksource_reset();
}
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index b8a14d2fb5ba..796a127aabb9 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -199,6 +199,9 @@ void clocksource_mark_unstable(struct clocksource *cs)
spin_unlock_irqrestore(&watchdog_lock, flags);
}
+int max_cswd_coarse_reads = 100;
+module_param(max_cswd_coarse_reads, int, 0644);
+EXPORT_SYMBOL_GPL(max_cswd_coarse_reads);
ulong max_cswd_read_retries = 3;
module_param(max_cswd_read_retries, ulong, 0644);
EXPORT_SYMBOL_GPL(max_cswd_read_retries);
@@ -226,13 +229,22 @@ static bool cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
smp_processor_id(), watchdog->name, nretries);
}
- return true;
+ cs->n_coarse_reads = 0;
+ return false;
}
+ WARN_ONCE(max_cswd_coarse_reads > 0 &&
+ !(++cs->n_coarse_reads % max_cswd_coarse_reads),
+ "timekeeping watchdog on CPU%d: %s %u consecutive coarse-grained reads\n", smp_processor_id(), watchdog->name, cs->n_coarse_reads);
}
- pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, marking unstable\n",
- smp_processor_id(), watchdog->name, wd_delay, nretries);
- return false;
+ if ((cs->flags & CLOCK_SOURCE_WATCHDOG) && !atomic_read(&watchdog_reset_pending)) {
+ pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, coarse-grained skew check followed by re-initialization\n",
+ smp_processor_id(), watchdog->name, wd_delay, nretries);
+ } else {
+ pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d, awaiting re-initialization\n",
+ smp_processor_id(), watchdog->name, wd_delay, nretries);
+ }
+ return true;
}
static u64 csnow_mid;
@@ -356,6 +368,7 @@ static void clocksource_watchdog(struct timer_list *unused)
int next_cpu, reset_pending;
int64_t wd_nsec, cs_nsec;
struct clocksource *cs;
+ bool coarse;
u32 md;
spin_lock(&watchdog_lock);
@@ -373,16 +386,13 @@ static void clocksource_watchdog(struct timer_list *unused)
continue;
}
- if (!cs_watchdog_read(cs, &csnow, &wdnow)) {
- /* Clock readout unreliable, so give it up. */
- __clocksource_unstable(cs);
- continue;
- }
+ coarse = cs_watchdog_read(cs, &csnow, &wdnow);
/* Clocksource initialized ? */
if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
atomic_read(&watchdog_reset_pending)) {
- cs->flags |= CLOCK_SOURCE_WATCHDOG;
+ if (!coarse)
+ cs->flags |= CLOCK_SOURCE_WATCHDOG;
cs->wd_last = wdnow;
cs->cs_last = csnow;
continue;
@@ -403,7 +413,13 @@ static void clocksource_watchdog(struct timer_list *unused)
continue;
/* Check the deviation from the watchdog clocksource. */
- md = cs->uncertainty_margin + watchdog->uncertainty_margin;
+ if (coarse) {
+ md = 62500 * NSEC_PER_USEC;
+ cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
+ pr_warn("timekeeping watchdog on CPU%d: %s coarse-grained %lu.%03lu ms clock-skew check followed by re-initialization\n", smp_processor_id(), watchdog->name, md / NSEC_PER_MSEC, md % NSEC_PER_MSEC / NSEC_PER_USEC);
+ } else {
+ md = cs->uncertainty_margin + watchdog->uncertainty_margin;
+ }
if (abs(cs_nsec - wd_nsec) > md) {
pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
smp_processor_id(), cs->name);