Re: [PATCH v2 0/4] clocksource: Avoid incorrect hpet fallback

From: Paul E. McKenney
Date: Wed Nov 17 2021 - 11:54:30 EST

On Tue, Nov 16, 2021 at 06:44:22PM -0500, Waiman Long wrote:
> It was found that when an x86 system was being stressed by running
> various different benchmark suites, the clocksource watchdog might
> occasionally mark TSC as unstable and fall back to hpet which will
> have a signficant impact on system performance.
> The current watchdog clocksource skew threshold of 50us is found to be
> insufficient. So it is changed back to 100us before commit 2e27e793e280
> ("clocksource: Reduce clocksource-skew threshold") in patch 1. This
> patch also skip the current clock skew check if the consecutive watchdog
> read-back delay contributes a major portion of the total delay. On a
> 1-socket 64-thread test system, it was actually found that in one the
> test sample, the hpet-tsc-hpet delay was 95263ns, while the corresponding
> hpet-hpet delay was 94425ns. So the majority of the delay is caused by
> the hpet read.
> Patch 2 reduces the default clocksource_watchdog() retries to 2 as
> suggested by Paul.
> Patch 3 implements dynamic readjustment of the new internal
> watchdog_max_skew variable in case the current value causes excessive
> skipping of clock skew checks. The following reproducer provided by
> Feng Tang was used to cause the test skipping:
> sudo stress-ng --timeout 30 --times --verify --metrics-brief --ioport <n>
> where <n> is the number of cpus in the system.
> A sample watchdog_max_skew readjustment output was:
> [ 197.771144] clocksource: timekeeping watchdog on CPU8: hpet wd-wd read-back delay of 92539ns
> [ 197.789589] clocksource: wd-tsc-wd read-back delay of 90933ns, clock-skew test skipped!
> [ 197.807145] clocksource: timekeeping watchdog on CPU8: watchdog_max_skew increased to 185078ns
> To avoid excessive increase of watchdog_max_skew, a limit of
> 10*WATCHDOG_MAX_SKEW is used over which the watchdog itself will be
> mark unstable and a new watchdog will be selected if possible.
> To exercise the code, WATCHDOG_MAX_SKEW was reduced to 10us. After
> skipping 10 checks, the watchdog then fell back to acpi_pm. However
> the corresponding consecutive watchdog delay was still about the same
> leading to ping-ponging between hpet and acpi_pm becoming the watchdog.
> Patch 4 adds a Kconfig option to allow kernel builder to control the
> actual WATCHDOG_MAX_SKEW threshold to be used.

A few questions:

1. Once you have all the patches in place, is the increase in
WATCHDOG_MAX_SKEW from 50us to 100us necessary?

2. The reason for having cs->uncertainty_margin set to
2*WATCHDOG_MAX_SKEW was to allow for worst-case skew from both
the previous and the current reading. Are you sure that
dropping back to WATCHDOG_MAX_SKEW avoids false positives?

3. In patch 3/4, shouldn't clock_skew_skip be a field in the
clocksource structure rather than a global? If a system had
multiple clocks being checked, wouldn't having this as a field
make things more predictable? Or am I missing something subtle

4. These are intended to replace this commit in -rcu, correct?

9d5739316f36 ("clocksource: Forgive repeated long-latency watchdog clocksource reads")

But not this commit, correct?

5444fb39fd49 ("torture: Test splatting for delay-ridden clocksources")

And would you like me to queue these, or would you rather send them
separately? (Either way works for me, just please let me know.)

Thanx, Paul

> Waiman Long (4):
> clocksource: Avoid accidental unstable marking of clocksources
> clocksource: Reduce the default clocksource_watchdog() retries to 2
> clocksource: Dynamically increase watchdog_max_skew
> clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW
> .../admin-guide/kernel-parameters.txt | 4 +-
> kernel/time/Kconfig | 9 ++
> kernel/time/clocksource.c | 121 +++++++++++++++---
> 3 files changed, 114 insertions(+), 20 deletions(-)
> --
> 2.27.0