Re: [PATCH 1/2] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for KVM clock drift fixup

From: David Woodhouse
Date: Mon Apr 08 2024 - 23:51:13 EST

Next message: Jason Wang: "Re: [PATCH v3] vp_vdpa: fix the method of calculating vectors"
Previous message: Xu Yilun: "Re: [PATCH v2] fpga: dfl: fme: revise kernel-doc comments for some functions"
In reply to: Dongli Zhang: "Re: [PATCH 1/2] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for KVM clock drift fixup"
Next in thread: Allister, Jack: "Re: [PATCH 1/2] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for KVM clock drift fixup"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, 2024-04-08 at 17:34 -0700, Dongli Zhang wrote:
> Hi Jack,
>
> On 4/8/24 15:07, Jack Allister wrote:
> > There is a potential for drift between the TSC and a KVM/PV clock when the
> > guest TSC is scaled (as seen previously in [1]). Which fixed drift between
> > timers over the lifetime of a VM.
>
> Those patches mentioned "TSC scaling" mutiple times. Is it a necessary to
> reproduce this issue? I do not think it is necessary. The tsc scaling may speed
> up the drift, but not the root cause.

Right. See the three definitions of the KVM clock I described in
https://lore.kernel.org/kvm/7e0040f70c629d365e80d13b339a95e0affa6d61.camel@xxxxxxxxxxxxx/

One is based on the host CLOCK_MONOTONIC_RAW + ka->kvmclock_offset, the
second is based on the reference point in master_kernel_ns /
master_cycle_now, and runs at the rate of the (scaled) guest TSC.

It's the *third* definition, which I called 'Definition C', which is
purely a bug, and which comes from running at the rate of the
*unscaled* guest TSC. And which doesn't exist without TSC scaling.

I don't think this patch needs to talk about the TSC scaling issues at
all. Those, and the existence of 'Definition C', are just bugs.

For that matter, I'm not entirely sure I'd have mentioned 'drift'
either. Yes, the two non-buggy definitions do drift from each other but
that isn't the point of this patch. Let's try again to explain what
*is* the point of this patch...

In all sane environments, ka->use_master_clock is set and 'Definition
B' of the KVM clock is used. The KVM clock is a simple arithmetic
function of the guest TSC.

The existing KVM_GET_CLOCK function isn't just buggy because it uses
'Definition C' and ignores TSC scaling. It's also just fundamentally
wrong as an API.

First let's consider the case of live *update*, where we serialize the
guest and kexec into a new kernel, then resum the guest. So we don't
have to deal with the full horrors of TSC migration, and the broken
algorithm described in the kernel documentation for doing that; we can
just preserve the KVM_VCPU_TSC_OFFSET. (We'll come to that problem
later).

Now, how do we restore the KVM clock? There is no real excuse for this
not being cycle-accurate, and giving the *same* pvclock information
back to the guest as before. The TSC wasn't disrupted at all during the
steal time. And the KVM clock should be precisely the *same* arithmetic
function of the TSC. The pvclock structure shouldn't change by a single
bit! Although in fact we *will* tolerate the data structure changing,
but only if it gives the same *result* (±1ns for rounding as Jack
notes).

KVM_[GS]ET_CLOCK are fundamentally useless for this. KVM_SET_CLOCK can
attempt to set the KVM clock at a given UTC time, but UTC is an awful
reference. Its frequency is also skewed by NTP, which is probably going
to be unsynchronized on the newly-booted kernel and running at a
different rate w.r.t the TSC than it was on the source.

Using UTC was always stupid because of leap seconds; TAI would have
been better. And one might argue that that aside, KVM_[GS]ET_CLOCK
isn't *such* an awful thing to use for live *migration*.

But wait, we *also* have to migrate the TSC using a wallclock time as a
base, and that naturally introduces some unavoidable discontinuity in
the TSC. So why in $DEITY's name would we use wallclock time as the
reference for migrating the KVM clock, giving it a *different* time
jump from the TSC on which it is based?

Even for live migration, even though the accuracy of the guest clocks
are fundamentally limited by the accuracy of the NTP sync between the
source and destination hosts, why wouldn't the clock jump at least be
*consistent*? We want that precise arithmetic relationship from TSC to
KVM clock to be preserved even in the live *migration* case where we
have to accept that the TSC itself isn't cycle-accurate.

Because we don't have a way to accurately preserve the TSC→KVM clock
relationship, we are seeing the clocksource watchdog trigger in guests
and disable the TSC clocksource... because the *KVM clock* was used as
a reference and got corrupted :)

So, Jack's patch adds KVM_[GS]ET_CLOCK_GUEST which reads and writes the
actual struct pvclock_vcpu_time_info structure.

> How about to cite the below patch as the beginning. The below patch only
> *avoids* KVM_REQ_MASTERCLOCK_UPDATE in some situations, but never solve the
> problem when KVM_REQ_MASTERCLOCK_UPDATE is triggered ... therefore we need this
> patchset ...
>
> KVM: x86: Don't unnecessarily force masterclock update on vCPU hotplug
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c52ffadc65e28ab461fd055e9991e8d8106a0056
>
> I think this patch is only closely related to KVM_REQ_MASTERCLOCK_UPDATE, not
> TSC scaling.
>
> ...
>
> Same as above. I think the key is to explain the issue when
> KVM_REQ_MASTERCLOCK_UPDATE is triggered, not to emphasize the TSC scaling.
> Please correct me if I am wrong.

Not sure; I think that's a separate issue too.

> >
> > Additional interfaces are added to retrieve & fixup the PV time information
> > when a VMM may believe is appropriate (deserialization after live-update/
> > migration). KVM_GET_CLOCK_GUEST can be used for the VMM to retrieve the
> > currently used PV time information and then when the VMM believes a drift
> > may occur can then instruct KVM to perform a correction via the setter
> > KVM_SET_CLOCK_GUEST.
> >
> > The KVM_SET_CLOCK_GUEST ioctl works under the following premise. The host
> > TSC & kernel timstamp are sampled at a singular point in time. Using the
>
> Typo: "timstamp"
>
> > already known scaling/offset for L1 the guest TSC is then derived from this
>
> I assume you meant to derive guest TSC from TSC offset/scaling, not to derive
> kvmclock. What does "TSC & kernel timstamp" mean?

In ka->use_master_clock mode (Definition B), the KVM clock is defined
in terms of guest TSC ticks since a reference point in time, stored in
ka->master_cycle_now and ka->master_kernel_ns. That is, a TSC, and a
kernel (CLOCK_MONOTONIC_RAW, to be precise) timestamp.

Jack's KVM_SET_CLOCK_GUEST should be taking a new reference point with
kvm_get_time_and_clockread(). As we know, changing the reference point
like that has basically the same effect as a stray invocation of
KVM_REQ_MASTERCLOCK_UPDATE — it yanks the KVM clock back to 'Definition
A' based on the CLOCK_MONOTONIC_RAW. But we correct for that...

Next we calculate (correctly via guest TSC frequency) the KVM clock
value which would result with the newly changed reference point.

Then we calculate the *intended* KVM clock at that *same* host TSC
value, by converting to a guest TSC value and running it through the
pvclock information passed into the ioctl.

Then adjust ka->kvmclock_offset by the delta between the two.

And now the KVM clock at this moment is set to *precisely* what it
would was before the live {update,migration}, in relation to the guest
TSC. At least within 1ns.

> > information.
> >
> > From here two PV time information structures are created, one which is the
> > original time information structure prior to whatever may have caused a PV
> > clock re-calculation (live-update/migration). The second is then using the
> > singular point in time sampled just prior. An individual KVM/PV clock for
> > each of the PV time information structures using the singular guest TSC.
> >
> > A delta is then determined between the two calculated PV times, which is
> > then used as a correction offset added onto the kvmclock_offset for the VM.
> >
> > [1]: https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=451a707813ae__;!!ACWV5N9M2RV99hQ!OnMXeXj4Plz6xvAc5lYsKaR3d1GDGGGRhZkdLMbxr8Skc_VAv_O1H8qP9igQv4KPCtYDw2ShTUtEd2o3mD5R$ ;
> >
> > Suggested-by: David Woodhouse <dwmw2@xxxxxxxxxxxxx>
> > Signed-off-by: Jack Allister <jalliste@xxxxxxxxxx>
> > CC: Paul Durrant <paul@xxxxxxx>
> > ---
> > Documentation/virt/kvm/api.rst | 43 +++++++++++++++++
> > arch/x86/kvm/x86.c             | 87 ++++++++++++++++++++++++++++++++++
> > include/uapi/linux/kvm.h       | 3 ++
> > 3 files changed, 133 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 0b5a33ee71ee..5f74d8ac1002 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6352,6 +6352,49 @@ a single guest_memfd file, but the bound ranges must not overlap).
> >
> > See KVM_SET_USER_MEMORY_REGION2 for additional details.
> >
> > +4.143 KVM_GET_CLOCK_GUEST
> > +----------------------------
> > +
> > +:Capability: none
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: struct pvclock_vcpu_time_info (out)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Retrieves the current time information structure used for KVM/PV clocks.
> > +On x86 a PV clock is derived from the current TSC and is then scaled based
> > +upon the a specified multiplier and shift. The result of this is then added
> > +to a system time.
>
> Typo: "the a".
>
> > +
> > +The guest needs a way to determine the system time, multiplier and shift. This
> > +can be done by multiple ways, for KVM guests this can be via an MSR write to
> > +MSR_KVM_SYSTEM_TIME / MSR_KVM_SYSTEM_TIME_NEW which defines the guest physical
> > +address KVM shall put the structure. On Xen guests this can be found in the Xen
> > +vcpu_info.
> > +
> > +This is structure is useful information for a VMM to also know when taking into
> > +account potential timer drift on live-update/migration.
> > +
> > +4.144 KVM_SET_CLOCK_GUEST
> > +----------------------------
> > +
> > +:Capability: none
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: struct pvclock_vcpu_time_info (in)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Triggers KVM to perform a correction of the KVM/PV clock structure based upon a
> > +known prior PV clock structure (see KVM_GET_CLOCK_GUEST).
> > +
> > +If a VM is utilizing TSC scaling there is a potential for a drift between the
> > +KVM/PV clock and the TSC itself. This is due to the loss of precision when
> > +performing a multiply and shift rather than divide for the TSC.
> > +
> > +To perform the correction a delta is calculated between the original time info
> > +(which is assumed correct) at a singular point in time X. The KVM clock offset
> > +is then offset by this delta.
> > +
> > 5. The kvm_run structure
> > ========================
> >
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 47d9f03b7778..5d2e10cd1c30 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -6988,6 +6988,87 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
> >         return 0;
> > }
> >
> > +static struct kvm_vcpu *kvm_get_bsp_vcpu(struct kvm *kvm)
> > +{
> > +       struct kvm_vcpu *vcpu = NULL;
> > +       int i;
> > +
> > +       for (i = 0; i < KVM_MAX_VCPUS; i++) {
> > +               vcpu = kvm_get_vcpu_by_id(kvm, i);
> > +               if (!vcpu)
> > +                       continue;
> > +
> > +               if (kvm_vcpu_is_reset_bsp(vcpu))
> > +                       break;
> > +       }
> > +
> > +       return vcpu;
> > +}
>
> Would the above rely not only on TSC clocksource, but also
> ka->use_master_clock==true?
>
> 3125         ka->use_master_clock = host_tsc_clocksource && vcpus_matched
> 3126                                 && !ka->backwards_tsc_observed
> 3127                                 && !ka->boot_vcpu_runs_old_kvmclock;
>
> Should the condition of (ka->use_master_clock==true) be checked in the ioctl?
>
> > +
> > +static int kvm_vm_ioctl_get_clock_guest(struct kvm *kvm, void __user *argp)
> > +{
> > +       struct kvm_vcpu *vcpu;
> > +
> > +       vcpu = kvm_get_bsp_vcpu(kvm);
> > +       if (!vcpu)
> > +               return -EINVAL;
> > +
> > +       if (!vcpu->arch.hv_clock.tsc_timestamp || !vcpu->arch.hv_clock.system_time)
> > +               return -EIO;
> > +
> > +       if (copy_to_user(argp, &vcpu->arch.hv_clock, sizeof(vcpu->arch.hv_clock)))
> > +               return -EFAULT;
>
> What will happen if the vCPU=0 (e.g., BSP) thread is racing with here to update
> the vcpu->arch.hv_clock?
>
> It is a good idea to making assumption from the VMM (e.g., QEMU) side?

What if the vCPU has never set up the KVM clock and vcpu->arch.hv_clock
isn't even populated?

I don't think we should be using an actual vCPU at all.

I think we have to *create* the pvclock information, just just blindly
memcpy from some vCPU's ->arch.hv_clock.

Yes, we do need to scale via guest TSC frequency, but *only* in the
ka->use_master_clock case because otherwise, everything's hosed anyway.
In the ka->use_master_clock case, where we know all guests are running
at the same TSC frequency, why not just snapshot the scaling factor?
We can fix the broken __get_kvmclock_ns() that way too.

In fact, in the !ka->use_master_clock case, this ioctl should probably
return an error. Because in that case, the KVM clock *isn't* stable in
terms of the guest TSC as it would be in a sane world, and it doesn't
make sense to care about migrating it accurately.

We should think about the case where ka->use_master_clock is true on
the source, but *false* on the destination. Could KVM_SET_CLOCK_GUEST
still work in that mode, by calling compute_guest_tsc()?

> > +
> > +       return 0;
> > +}
> > +
> > +static int kvm_vm_ioctl_set_clock_guest(struct kvm *kvm, void __user *argp)
> > +{
> > +       struct kvm_vcpu *vcpu;
> > +       struct pvclock_vcpu_time_info orig_pvti;
> > +       struct pvclock_vcpu_time_info dummy_pvti;
> > +       int64_t kernel_ns;
> > +       uint64_t host_tsc, guest_tsc;
> > +       uint64_t clock_orig, clock_dummy;
> > +       int64_t correction;
> > +       unsigned long i;
>
> Please ignore me if there is not any chance to make the above (and other places
> in the patchset) to honor reverse xmas tree style.
>
> > +
> > +       vcpu = kvm_get_bsp_vcpu(kvm);
> > +       if (!vcpu)
> > +               return -EINVAL;
> > +
> > +       if (copy_from_user(&orig_pvti, argp, sizeof(orig_pvti)))
> > +               return -EFAULT;
> > +
> > +       /*
> > +        * Sample the kernel time and host TSC at a singular point.
> > +        * We then calculate the guest TSC using this exact point in time,
> > +        * From here we can then determine the delta using the
> > +        * PV time info requested from the user and what we currently have
> > +        * using the fixed point in time. This delta is then used as a
> > +        * correction factor to fixup the potential drift.
> > +        */
> > +       if (!kvm_get_time_and_clockread(&kernel_ns, &host_tsc))
> > +               return -EFAULT;
> > +
> > +       guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc);
> > +
> > +       dummy_pvti = orig_pvti;
> > +       dummy_pvti.tsc_timestamp = guest_tsc;
> > +       dummy_pvti.system_time = kernel_ns + kvm->arch.kvmclock_offset;
> > +
> > +       clock_orig = __pvclock_read_cycles(&orig_pvti, guest_tsc);
> > +       clock_dummy = __pvclock_read_cycles(&dummy_pvti, guest_tsc);
> > +
> > +       correction = clock_orig - clock_dummy;
> > +       kvm->arch.kvmclock_offset += correction;
>
> I am not sure if it is a good idea to rely on userspace VMM to decide the good
> timepoint to issue the ioctl, without assuming any racing.
>
> In addition to live migration, can the user call this API any time during the VM
> is running (to fix the clock drift)? Therefore, any requirement to protect the
> update of kvmclock_offset from racing?
>

$DEITY no. The clock drift at *runtime* due to stray calls of
KVM_REQ_MASTERCLOCK_UPDATE is just a kernel bug. It isn't the user's
responsibility to correct it!

However... where invoked in ka->use_masterclock_mode, perhaps
pvclock_update_vm_gtod_copy() *should* perform the same kind of
calculation, and actually *adjust* ka->kvmclock_offset as the true
definition of the KVM clock drifts from CLOCK_MONOTONIC_RAW?

But internally; never seen by userspace.

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Next message: Jason Wang: "Re: [PATCH v3] vp_vdpa: fix the method of calculating vectors"
Previous message: Xu Yilun: "Re: [PATCH v2] fpga: dfl: fme: revise kernel-doc comments for some functions"
In reply to: Dongli Zhang: "Re: [PATCH 1/2] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for KVM clock drift fixup"
Next in thread: Allister, Jack: "Re: [PATCH 1/2] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for KVM clock drift fixup"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]