Re: [PATCH 3/3] KVM: x86: conditionally update masterclock data in pvclock_update_vm_gtod_copy()

From: David Woodhouse

Date: Tue May 12 2026 - 01:27:16 EST

On Mon, 2026-05-11 at 17:16 -0700, Dongli Zhang wrote:
>
>
> On 5/9/26 5:22 AM, David Woodhouse wrote:
> > On Thu, 2026-01-15 at 12:22 -0800, Dongli Zhang wrote:
> > > The pvclock_update_vm_gtod_copy() function always unconditionally updates
> > > ka->master_kernel_ns and ka->master_cycle_now whenever a
> > > KVM_REQ_MASTERCLOCK_UPDATE occurs. Unfortunately, each masterclock update
> > > increases the risk of kvm-clock drift.
> > >
> > > If pvclock_update_vm_gtod_copy() is not called from
> > > vcpu_enter_guest()-->kvm_update_masterclock(), we keep the existing
> > > workflow. The argument 'forced' is introduced to tell where it is from.
> > >
> > > Otherwise, we avoid updating the masterclock if it is already
> > > active and will remain active. In such cases, updating the masterclock
> > > data is not beneficial and can instead lead to kvm-clock drift.
> > >
> > > As a result, this patch minimizes the chance of unnecessary masterclock
> > > data updates to avoid kvm-clock drift.
> > >
> > > Cc: David Woodhouse <dwmw@xxxxxxxxxxxx>
> > > Signed-off-by: Dongli Zhang <dongli.zhang@xxxxxxxxxx>
> >
> > Hmm... so the only caller of pvclock_update_vm_gtod_copy() that doesn't
> > set the 'force' argument is the one in kvm_update_masterclock(), so we
> > are asserting that kvm_update_masterclock() never needs to *change* the
> > masterclock origin point, if it was already set?
> >
> > The gtod notifier callback in pvclock_gtod_update_fn() also ends up
> > setting KVM_REQ_MASTERCLOCK_UPDATE, and is triggered by an actual host
> > timekeeping update (which could potentially be from a clocksource
> > change). It also hypothetically possible that the clocksource changes
> > from TSC → HPET → TSC, switching back to TSC again before the
> > masterclock update ever gets to run. Or maybe a suspend/resume?
> >
> > Are you *sure* that the optimisation is always valid...?
>
> Thank you very much!
>
> I didn't validate the scenario you mentioned. I missed that scenario because I
> assumed that most production systems nowadays use STABLE/CONSTANT/NONSTOP TSC as
> the host clocksource, although I sometimes forgot to add "clocksource=tsc
> tsc=reliable" to my AMD L1 KVM guest (acting as the hypervisor for L2 guest).

I'd love to assume that, but we do still have to cater for systems
without it (or where it goes wrong). And where userspace sets up vCPUs
with different frequencies... although I'd quite like to ban that too
:)

> I didn't follow up on this patch because I noticed another issue. I found that
> the tsc_timestamp in the PVTI can become a very large number if we simply reboot
> the guest VM. This happens because the patch stops updating the masterclock data
> when the masterclock is already active and remains active.
>
> For example:
>
> current guest TSC: 122763682
> PVTI->tsc_timestamp = 18446744073656540006
> PVTI->system_time=196515164269
>
> Although I could not reproduce any bug, that still made me feel uncomfortable.

That's just negative (-53011610). Theoretically it's OK; it's just a
consequence of using a reference point in the past. Probably just
*asking* for guest bugs though, so best avoided.

I'd like to understand how it got like that though. Did your 'reboot'
reset the guest TSC to zero but *not* its kvmclock?

Attachment: smime.p7s
Description: S/MIME cryptographic signature