Re: [PATCH v10 clocksource 1/7] clocksource: Provide module parameters to inject delays in watchdog

From: Luming Yu
Date: Wed Apr 28 2021 - 10:25:58 EST


On Wed, Apr 28, 2021 at 9:57 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
>
> On Wed, Apr 28, 2021 at 12:49:12PM +0800, Luming Yu wrote:
> > We 'd expect to see clock_source watchdog can avoid to do wrong thing
> > due to the injected delay or
> > in real life delay by doing tsc sync-re-check by applying the patch-set.
> > However , the noise is still cause wrong actions and the patch doesn't
> > defeat the injected's delay
> > please correct me if I'm wrong.
>
> Injecting delay is just a test. In real life, if you got four delays
> in a row, the cause is likely that the clock read is broken and taking
> a very long time. In which case marking that clock unstable is a
> reasonable response.
>
> Other causes include having an NMI or SMI storm, getting extremely
> unlucky with vCPU preemptions, and so on. In these cases, you are not
> making much forward progress anyway, so marked-unstable clock is the
> least of your worries.
>
> I ran this (without injected delays) for more than a thousand hours on
> rcutorture guest OSes and didn't see any instances of even two consecutive
> bad reads. There was the very occasional single instance of a bad read.
>
> Therefore, the code marks the clock unstable if it has four bad reads
> in a row, as it should.

The hard problem to solve is tsc is still in good shape and it can be verified
with a quick cross check with other thread/core's tsc counts in the
injected situation or in real life case
to prove if it is truly a tsc problem or reference clock's problem of
the watchdog.

Ideally, we could factor out hard-to-debug unstable tsc problems from
clock source watchdog problems
and get less and less tsc sightings caused by clock source watchdog.

>
> Thanx, Paul
>
> > parameters]# cat *
> > 1
> > 1
> > -1
> > 3
> > 8
> >
> > [62939.809615] clocksource: clocksource_watchdog_inject_delay():
> > Injecting delay.
> > [62939.816867] clocksource: clocksource_watchdog_inject_delay():
> > Injecting delay.
> > [62939.824094] clocksource: clocksource_watchdog_inject_delay():
> > Injecting delay.
> > [62939.831314] clocksource: clocksource_watchdog_inject_delay():
> > Injecting delay.
> > [62939.838536] clocksource: timekeeping watchdog on CPU26: hpet
> > read-back delay of 7220833ns, attempt 4, marking unstable
> > [62939.849230] tsc: Marking TSC unstable due to clocksource watchdog
> > [62939.855340] TSC found unstable after boot, most likely due to
> > broken BIOS. Use 'tsc=unstable'.
> > [62939.863972] sched_clock: Marking unstable (62943398530130,
> > -3543150114)<-(62941186607503, -1331276112)
> > [62939.875104] clocksource: Checking clocksource tsc synchronization
> > from CPU 123 to CPUs 0,6,26,62,78,97-98,137.
> > [62939.886518] clocksource: Switched to clocksource hpet
> >
> > On Tue, Apr 27, 2021 at 2:27 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> > >
> > > On Mon, Apr 26, 2021 at 10:56:27AM -0700, Andi Kleen wrote:
> > > > > ------------------------------------------------------------------------
> > > > >
> > > > > - module parameters
> > > > >
> > > > > If the scope of the fault injection capability is limited to a
> > > > > single kernel module, it is better to provide module parameters to
> > > > > configure the fault attributes.
> > > > >
> > > > > ------------------------------------------------------------------------
> > > > >
> > > > > And in this case, the fault injection capability is in fact limited to
> > > > > kernel/clocksource.c.
> > > >
> > > >
> > > > I disagree with this recommendation because it prevents fuzzer coverage.
> > > >
> > > > Much better to have an uniform interface that can be automatically
> > > > explored.
> > >
> > > The permissions for these module parameters is 0644, so there is no
> > > reason why the fuzzers cannot use them via sysfs.
> > >
> > > Thanx, Paul