Re: [PATCH v2 20/20] KVM: x86: Use gfn_to_pfn_cache for record_steal_time

From: Sean Christopherson

Date: Mon Jun 08 2026 - 20:45:51 EST

On Tue, Jun 02, 2026, David Woodhouse wrote:
> On Sat, 30 May 2026 06:19:32 +0000, sashiko-bot@xxxxxxxxxx wrote:
> > [Severity: High]
> > Does this introduce a scheduling while atomic bug on non-PREEMPT_RT kernels?
> >
> > The CLASS(gpc_map_local, st_map) macro acquires a read_lock on gpc->lock,
> > which disables preemption. While this lock is held, if the guest supports
> > PV TLB flush, the code calls kvm_vcpu_flush_tlb_guest().
> >
> > If TDP is disabled (shadow paging), kvm_vcpu_flush_tlb_guest() calls
> > kvm_mmu_sync_roots() and eventually mmu_sync_children(). This path can yield
> > via cond_resched_rwlock_write(). Yielding while preemption is disabled by
> > the gpc read lock will trigger a BUG.
>
> Ah, that issue exists in the previous versions too, but it's simple
> enough to fix. There's no particular timing constraint for flushing the
> TLB; it just have to be done before this vCPU ever runs again. It can
> just be moved to the end of the function after the lock is dropped.
>
> That does mean record_steal_time() should use the explicit
> gpc_map_local_lock()/gpc_map_local_unlock() instead of the CLASS()
> macro, but that's easy enough.

Actually, we use KVM_REQ_TLB_FLUSH_GUEST and "optimize" the code for the rare
case where KVM already have a TLB flushed queued for the vCPU. E.g. over two
patches (so that changing the order of the request processing is isolated):

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1b27dd9ba0aa..48234eeb246b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3764,7 +3764,7 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
trace_kvm_pv_tlb_flush(vcpu->vcpu_id,
st_preempted & KVM_VCPU_FLUSH_TLB);
if (st_preempted & KVM_VCPU_FLUSH_TLB)
- kvm_vcpu_flush_tlb_guest(vcpu);
+ kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
} else {
WRITE_ONCE(st->preempted, 0);
vcpu->arch.st.preempted = 0;
@@ -11165,6 +11165,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (unlikely(r))
goto out;
}
+ if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu))
+ record_steal_time(vcpu);
if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu))
kvm_mmu_sync_roots(vcpu);
if (kvm_check_request(KVM_REQ_LOAD_MMU_PGD, vcpu))
@@ -11214,8 +11216,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
r = 1;
goto out;
}
- if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu))
- record_steal_time(vcpu);
if (kvm_check_request(KVM_REQ_PMU, vcpu))
kvm_pmu_handle_event(vcpu);
if (kvm_check_request(KVM_REQ_PMI, vcpu))

KVM needs to ensure the RMW on st->preempted is atomic, to avoid re-introducing
the bug fixed by commit b043138246a4 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag
is not missed"), but AFAICT there's nothing that requires to complete the TLB
flush before bumping the version, KVM just needs to service the flush before
entering the guest on that vCPU.