Re: [PATCH RFC 1/1] KVM: x86: add param to update master clock periodically
From: David Woodhouse
Date: Wed Oct 04 2023 - 06:01:38 EST
On Tue, 2023-10-03 at 17:04 -0700, Sean Christopherson wrote:
> On Tue, Oct 03, 2023, David Woodhouse wrote:
> > On Mon, 2023-10-02 at 17:53 -0700, Sean Christopherson wrote:
> > >
> > > The two domains use the same "clock" (constant TSC), but different math to compute
> > > nanoseconds from a given TSC value. For decently large TSC values, this results
> > > in CLOCK_MONOTONIC_RAW and kvmclock computing two different times in nanoseconds.
> >
> > This is the bit I'm still confused about, and it seems to be the root
> > of all the other problems.
> >
> > Both CLOCK_MONOTONIC_RAW and kvmclock have *one* job: to convert a
> > number of ticks of the TSC running at a constant known frequency, to a
> > number of nanoseconds.
> >
> > So how in the name of all that is holy do they manage to get
> > *different* answers?
> >
> > I get that the mult/shift thing carries some imprecision, but is that
> > all it is?
>
> Yep, pretty sure that's it. It's like the plot from Office Space / Superman III.
> Those little rounding errors add up over time.
>
> PV clock:
>
> nanoseconds = ((TSC >> shift) * mult) >> 32
>
> or
>
> nanoseconds = ((TSC << shift) * mult) >> 32
>
> versus timekeeping (mostly)
>
> nanoseconds = (TSC * mult) >> shift
>
> The more I look at the PV clock stuff, the more I agree with Peter: it's garbage.
> Shifting before multiplying is guaranteed to introduce error. Shifting right drops
> data, and shifting left introduces zeros.
>
> > Can't we ensure that the kvmclock uses the *same* algorithm,
> > precisely, as CLOCK_MONOTONIC_RAW?
>
> Yes? At least for sane hardware, after much staring, I think it's possible.
>
> It's tricky because the two algorithms are wierdly different, the PV clock algorithm
> is ABI and thus immutable, and Thomas and the timekeeping folks would rightly laugh
> at us for suggesting that we try to shove the pvclock algorithm into the kernel.
>
> The hardcoded shift right 32 in PV clock is annoying, but not the end of the world.
>
> Compile tested only, but I believe this math is correct. And I'm guessing we'd
> want some safeguards against overflow, e.g. due to a multiplier that is too big.
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6573c89c35a9..ae9275c3d580 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3212,9 +3212,19 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
> v->arch.l1_tsc_scaling_ratio);
>
> if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
> - kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
> - &vcpu->hv_clock.tsc_shift,
> - &vcpu->hv_clock.tsc_to_system_mul);
> + u32 shift, mult;
> +
> + clocks_calc_mult_shift(&mult, &shift, tgt_tsc_khz, NSEC_PER_MSEC, 600);
> +
> + if (shift <= 32) {
> + vcpu->hv_clock.tsc_shift = 0;
> + vcpu->hv_clock.tsc_to_system_mul = mult * BIT(32 - shift);
> + } else {
> + kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
> + &vcpu->hv_clock.tsc_shift,
> + &vcpu->hv_clock.tsc_to_system_mul);
> + }
> +
> vcpu->hw_tsc_khz = tgt_tsc_khz;
> kvm_xen_update_tsc_info(v);
> }
>
I gave that a go on my test box, and for a TSC frequency of 2593992 kHz
it got mult=1655736523, shift=32 and took the 'happy' path instead of
falling back.
It still drifts about the same though, using the same test as before:
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/kvmclock
I was going to facetiously suggest that perhaps the kvmclock should
have leap nanoseconds... but then realised that that's basically what
Dongli's patch is *doing*. Maybe we just need to *recognise* that, so
rather than having a user-configured period for the update, KVM could
calculate the frequency for the updates based on the rate at which the
clocks would otherwise drift, and a maximum delta? Not my favourite
option, but perhaps better than nothing?
Attachment:
smime.p7s
Description: S/MIME cryptographic signature