Re: [PATCH 2/3] KVM: x86: conditionally clear KVM_REQ_MASTERCLOCK_UPDATE at the end of KVM_SET_CLOCK

From: David Woodhouse

Date: Sat May 09 2026 - 16:04:39 EST

On Thu, 2026-01-15 at 12:22 -0800, Dongli Zhang wrote:
> The KVM_SET_CLOCK command calls pvclock_update_vm_gtod_copy() to update the
> masterclock data.
>
> Many vCPUs may already have KVM_REQ_MASTERCLOCK_UPDATE pending before
> KVM_SET_CLOCK is invoked. If pvclock_update_vm_gtod_copy() decides to use
> the masterclock, there is no need to update the masterclock multiple times
> afterward. As noted in commit c52ffadc65e2 ("KVM: x86: Don't unnecessarily
> force masterclock update on vCPU hotplug"), each unnecessary
> KVM_REQ_MASTERCLOCK_UPDATE can cause the kvm-clock time to jump.
>
> Therefore, clear KVM_REQ_MASTERCLOCK_UPDATE for each vCPU at the end of
> KVM_SET_CLOCK when the master clock is active. The 'tsc_write_lock' ensures
> that only requests issued before KVM_SET_CLOCK are cleared.

Hm, I think we should do this in kvm_make_mclock_inprogress_request()
instead.

Currently, things like pvclock_gtod_update_fn() and one or two others
set KVM_REQ_MASTERCLOCK_UPDATE for all vCPUs. Not just one, because we
don't want *any* vCPU going back into guest mode before doing the work.

So they could *all* end up calling into kvm_masterclock_update() at the
same time. I have a vague recollection there was once a mutex there,
but now there isn't.

So all vCPUs can each, in parallel, set KVM_REQ_MCLOCK_INPROGRESS on
all other vCPUs. That's an unhandled request which just makes a vCPU
spin (preventing it from entering the guest until the clock update
completes).

Then each vCPU tries to take the kvm->arch.tsc_write_lock in
kvm_start_pvclock_update(), and from this point they'll be
serialized... but every vCPU will go through the whole of
pvclock_update_vm_gtod_copy() for itself, updating the masterclock one
more time for each vCPU in the system?

I came here to say that kvm_make_mclock_inprogress_request() should
probably clear KVM_REQ_MASTERCLOCK_UPDATE on all other vCPUs, since
it's about to do the work, and the other vCPUs will no longer need to.

But... it's more broken than that now.

I think maybe vcpu_enter_guest() should call kvm_update_masterclock()
if *kvm_test_request(KVM_REQ_MASTERCLOCK_UPDATE)* (i.e. not *clear* the
bit by using kvm_check_request()).

And then after getting ->arch.tsc_write_lock,
kvm_start_pvclock_update() should check if KVM_REQ_MASTERCLOCK_UPDATE
is *still* set, and only proceed with the update if it is?

I'll play with it some more... something like this perhaps?

From 91e37d74ee267c722bc2c8f26754fcdfbc040e29 Mon Sep 17 00:00:00 2001
From: David Woodhouse <dwmw@xxxxxxxxxxxx>
Date: Sat, 9 May 2026 16:54:01 +0100
Subject: [PATCH] KVM: x86: Avoid redundant masterclock updates from multiple
vCPUs

When a masterclock update is triggered (e.g. by the clocksource change
notifier), KVM_REQ_MASTERCLOCK_UPDATE is set on all vCPUs. Without this
fix, each vCPU independently processes the request and redundantly
re-executes the entire pvclock_update_vm_gtod_copy() sequence, serialized
only by tsc_write_lock. Each redundant re-snapshot of the master clock
reference point introduces potential clock drift.

Fix this by having __kvm_start_pvclock_update() check, after acquiring
the lock, whether the requesting vCPU's KVM_REQ_MASTERCLOCK_UPDATE is
still set. If another vCPU already did the update and cleared it, bail
out. Otherwise, clear the request on all other vCPUs before proceeding.

The caller in vcpu_enter_guest() now uses kvm_test_request() (non-clearing)
since the clearing is done inside __kvm_start_pvclock_update() under the
lock.

Suggested-by: Dongli Zhang <dongli.zhang@xxxxxxxxxx>
Signed-off-by: David Woodhouse <dwmw@xxxxxxxxxxxx>
---
arch/x86/kvm/x86.c | 56 ++++++++++++++++++++++++++++++++++++----------
1 file changed, 44 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ed0f7dec08bd..b249c4a6048c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3259,10 +3259,39 @@ static void kvm_make_mclock_inprogress_request(struct kvm *kvm)
kvm_make_all_cpus_request(kvm, KVM_REQ_MCLOCK_INPROGRESS);
}

-static void __kvm_start_pvclock_update(struct kvm *kvm)
+static void kvm_clear_mclock_inprogress_request(struct kvm *kvm)
{
+ struct kvm_vcpu *vcpu;
+ unsigned long i;
+
+ kvm_for_each_vcpu(i, vcpu, kvm)
+ kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);
+}
+
+static bool __kvm_start_pvclock_update(struct kvm *kvm, struct kvm_vcpu *requesting_vcpu)
+{
+ struct kvm_vcpu *vcpu;
+ unsigned long i;
+
raw_spin_lock_irq(&kvm->arch.tsc_write_lock);
+
+ /*
+ * If another vCPU already did the update while we were waiting
+ * for the lock, our request will have been cleared. Bail out.
+ */
+ if (requesting_vcpu &&
+ !kvm_test_request(KVM_REQ_MASTERCLOCK_UPDATE, requesting_vcpu)) {
+ kvm_clear_mclock_inprogress_request(kvm);
+ raw_spin_unlock_irq(&kvm->arch.tsc_write_lock);
+ return false;
+ }
+
+ /* The update is VM-wide; prevent other vCPUs from redoing it. */
+ kvm_for_each_vcpu(i, vcpu, kvm)
+ kvm_clear_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
+
write_seqcount_begin(&kvm->arch.pvclock_sc);
+ return true;
}

static void kvm_start_pvclock_update(struct kvm *kvm)
@@ -3270,7 +3299,7 @@ static void kvm_start_pvclock_update(struct kvm *kvm)
kvm_make_mclock_inprogress_request(kvm);

/* no guest entries from this point */
- __kvm_start_pvclock_update(kvm);
+ __kvm_start_pvclock_update(kvm, NULL);
}

static void kvm_end_pvclock_update(struct kvm *kvm)
@@ -3279,22 +3308,25 @@ static void kvm_end_pvclock_update(struct kvm *kvm)
struct kvm_vcpu *vcpu;
unsigned long i;

- write_seqcount_end(&ka->pvclock_sc);
- raw_spin_unlock_irq(&ka->tsc_write_lock);
kvm_for_each_vcpu(i, vcpu, kvm)
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);

/* guest entries allowed */
- kvm_for_each_vcpu(i, vcpu, kvm)
- kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);
+ kvm_clear_mclock_inprogress_request(kvm);
+
+ write_seqcount_end(&ka->pvclock_sc);
+ raw_spin_unlock_irq(&ka->tsc_write_lock);
}

-static void kvm_update_masterclock(struct kvm *kvm)
+static void kvm_update_masterclock(struct kvm *kvm, struct kvm_vcpu *vcpu)
{
kvm_hv_request_tsc_page_update(kvm);
- kvm_start_pvclock_update(kvm);
- pvclock_update_vm_gtod_copy(kvm);
- kvm_end_pvclock_update(kvm);
+ kvm_make_mclock_inprogress_request(kvm);
+
+ if (__kvm_start_pvclock_update(kvm, vcpu)) {
+ pvclock_update_vm_gtod_copy(kvm);
+ kvm_end_pvclock_update(kvm);
+ }
}

/*
@@ -11481,8 +11513,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
kvm_mmu_free_obsolete_roots(vcpu);
if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
__kvm_migrate_timers(vcpu);
- if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu))
- kvm_update_masterclock(vcpu->kvm);
+ if (kvm_test_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu))
+ kvm_update_masterclock(vcpu->kvm, vcpu);
if (kvm_check_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu))
kvm_gen_kvmclock_update(vcpu);
if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {
--
2.43.0

Attachment: smime.p7s
Description: S/MIME cryptographic signature