Re: [tip: sched/core] sched/clock/x86: Mark sched_clock() noinstr

From: Sean Christopherson
Date: Wed Jan 15 2025 - 18:22:35 EST


+KVM and Paolo

On Tue, Jan 31, 2023, tip-bot2 for Peter Zijlstra wrote:
> The following commit has been merged into the sched/core branch of tip:
>
> Commit-ID: 8739c6811572b087decd561f96382087402cc343
> Gitweb: https://git.kernel.org/tip/8739c6811572b087decd561f96382087402cc343
> Author: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> AuthorDate: Thu, 26 Jan 2023 16:08:36 +01:00
> Committer: Ingo Molnar <mingo@xxxxxxxxxx>
> CommitterDate: Tue, 31 Jan 2023 15:01:47 +01:00
>
> sched/clock/x86: Mark sched_clock() noinstr
>
> In order to use sched_clock() from noinstr code, mark it and all it's
> implenentations noinstr.
>
> The whole pvclock thing (used by KVM/Xen) is a bit of a pain,
> since it calls out to watchdogs, create a
> pvclock_clocksource_read_nowd() variant doesn't do that and can be
> noinstr.

...

> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 16333ba..0f35d44 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -71,12 +71,12 @@ static int kvm_set_wallclock(const struct timespec64 *now)
> return -ENODEV;
> }
>
> -static u64 kvm_clock_read(void)
> +static noinstr u64 kvm_clock_read(void)
> {
> u64 ret;
>
> preempt_disable_notrace();
> - ret = pvclock_clocksource_read(this_cpu_pvti());
> + ret = pvclock_clocksource_read_nowd(this_cpu_pvti());
> preempt_enable_notrace();
> return ret;
> }
> @@ -86,7 +86,7 @@ static u64 kvm_clock_get_cycles(struct clocksource *cs)
> return kvm_clock_read();
> }
>
> -static u64 kvm_sched_clock_read(void)
> +static noinstr u64 kvm_sched_clock_read(void)
> {
> return kvm_clock_read() - kvm_sched_clock_offset;
> }
> diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
> index 5a2a517..56acf53 100644
> --- a/arch/x86/kernel/pvclock.c
> +++ b/arch/x86/kernel/pvclock.c
> @@ -64,7 +64,8 @@ u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src)
> return flags & valid_flags;
> }
>
> -u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
> +static __always_inline
> +u64 __pvclock_clocksource_read(struct pvclock_vcpu_time_info *src, bool dowd)
> {
> unsigned version;
> u64 ret;
> @@ -77,7 +78,7 @@ u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
> flags = src->flags;
> } while (pvclock_read_retry(src, version));
>
> - if (unlikely((flags & PVCLOCK_GUEST_STOPPED) != 0)) {
> + if (dowd && unlikely((flags & PVCLOCK_GUEST_STOPPED) != 0)) {

This effectively broke sched_clock handling of PVCLOCK_GUEST_STOPPED, which was
added by commit d63285e94af3 ("pvclock: detect watchdog reset at pvclock read").

By skipping the watchdog goo in the kvm_clock_read() path, the watchdogs will only
be petted when kvmclock's pvclock_read_wallclock() is invoked, which AFAICT is
limited to guest boot and suspend+resume (in the guest). I.e. PVCLOCK_GUEST_STOPPED
is ignored until the *guest* suspends, which completely defeats the purpose of
petting the watchdogs when reading the clock.

I'm guessing no one has noticed/complained because watchdog_timer_fn,
wq_watchdog_timer_fn(), and check_cpu_stall() all manually handling a STOPPED
vCPU by invoking kvm_check_and_clear_guest_paused(). At a glance, only the
hung task detector isn't in on the game, but its doggie still gets petted so
long as one of the other watchdogs runs first.

One option would be to sprinkle noinstr everywhere, because despite
pvclock_touch_watchdogs() calling a lot of functions, all of those functions
are little more than one-liners, i.e. can be noinstr without significant downsides.

Alternatively, we could explicitly revert commit d63285e94af3, and then bribe
someone into finding a better/common place to handle PVCLOCK_GUEST_STOPPED.
I am leaning heavily toward this alternative, because the way things are headed
with kvmclock is that the kernel will stop using it for sched_clock[*], at which
point VMs that run on modern setups won't get the sched_clock behavior anyways.

[*] https://lore.kernel.org/all/Z4gqlbumOFPF_rxd@xxxxxxxxxx