[PATCH 2/2] KVM: nVMX: Don't use vmcs01.GUEST_CR3 to snapshot L1's CR3 when EPT is disabled
From: Sean Christopherson
Date: Wed Jun 03 2026 - 18:34:53 EST
Add a dedicated field in "struct nested_vmx" to track L1's pre-VM-Enter CR3
instead of using vmcs01.GUEST_CR3, which isn't anywhere near as safe as the
comment purports it to be. E.g. in addition to the warn_on_missed_cc bug
(that was fixed by relocating the consistency check), if getting vmcs12
pages (during actual nested VM-Entry) fails and EPT is disabled (in KVM),
KVM will return control to userspace with vmcs01.GUEST_CR3 holding a guest-
controlled value.
Alternatively, KVM could force a reload of vmcs01.GUEST_CR3 by resetting
the MMU context in the error path, but as above, the safety of the vmcs01
approach is extremely questionable, e.g. it took all of ~4 months for the
code to break.
Fixes: 671ddc700fd0 ("KVM: nVMX: Don't leak L1 MMIO regions to L2")
Cc: stable@xxxxxxxxxxxxxxx
Cc: Jim Mattson <jmattson@xxxxxxxxxx>
Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
---
arch/x86/kvm/vmx/nested.c | 21 ++++++++-------------
arch/x86/kvm/vmx/vmx.h | 7 +++++++
2 files changed, 15 insertions(+), 13 deletions(-)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 039e234e7d2b..772b8090d06a 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3668,19 +3668,14 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
&vmx->nested.pre_vmenter_ssp_tbl);
/*
- * Overwrite vmcs01.GUEST_CR3 with L1's CR3 if EPT is disabled. In the
- * event of a "late" VM-Fail, i.e. a VM-Fail detected by hardware but
- * not KVM, KVM must unwind its software model to the pre-VM-Entry host
- * state. When EPT is disabled, GUEST_CR3 holds KVM's shadow CR3, not
- * L1's "real" CR3, which causes nested_vmx_restore_host_state() to
- * corrupt vcpu->arch.cr3. Stuffing vmcs01.GUEST_CR3 results in the
- * unwind naturally setting arch.cr3 to the correct value. Smashing
- * vmcs01.GUEST_CR3 is safe because nested VM-Exits, and the unwind,
- * reset KVM's MMU, i.e. vmcs01.GUEST_CR3 is guaranteed to be
- * overwritten with a shadow CR3 prior to re-entering L1.
+ * Stash L1's CR3, so that in the event of a "late" VM-Fail, i.e. a
+ * VM-Fail detected by hardware but not KVM, KVM can unwind its
+ * software model to the pre-VM-Entry host state. When EPT is
+ * disabled, GUEST_CR3 holds KVM's shadow CR3, not L1's "real" CR3,
+ * and so simply restoring from vmcs01.GUEST_CR3 would corrupt
+ * vcpu->arch.cr3.
*/
- if (!enable_ept)
- vmcs_writel(GUEST_CR3, vcpu->arch.cr3);
+ vmx->nested.pre_vmenter_cr3 = vcpu->arch.cr3;
vmx_switch_vmcs(vcpu, &vmx->nested.vmcs02);
@@ -4992,7 +4987,7 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
vmx_set_cr4(vcpu, vmcs_readl(CR4_READ_SHADOW));
nested_ept_uninit_mmu_context(vcpu);
- vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
+ vcpu->arch.cr3 = vmx->nested.pre_vmenter_cr3;
kvm_register_mark_available(vcpu, VCPU_REG_CR3);
/*
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index de9de0d2016c..dc8517f15bc4 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -159,6 +159,13 @@ struct nested_vmx {
bool has_preemption_timer_deadline;
bool preemption_timer_expired;
+ /*
+ * Used to restore L1's CR3 if hardware detects a VM-Fail Consistency
+ * Check that KVM does not, in which case KVM needs to unwind CR3 back
+ * to its pre-VM-Enter state, NOT to vmcs01.HOST_CR3.
+ */
+ unsigned long pre_vmenter_cr3;
+
/*
* Used to snapshot MSRs that are conditionally loaded on VM-Enter in
* order to propagate the guest's pre-VM-Enter value into vmcs02. For
--
2.54.0.1032.g2f8565e1d1-goog