Re: [PATCH 3/3] KVM: x86: Check for injected exceptions before queuing a debug exception
From: Sean Christopherson
Date: Fri Feb 27 2026 - 11:13:13 EST
On Fri, Feb 27, 2026, Yosry Ahmed wrote:
> On KVM_SET_GUEST_DEBUG, if a #DB or #BP is injected with
> KVM_GUESTDBG_INJECT_DB or KVM_GUESTDBG_INJECT_BP, KVM fails with -EBUSY
> if there is an existing pending exception. This was introduced in
> commit 4f926bf29186 ("KVM: x86: Polish exception injection via
> KVM_SET_GUEST_DEBUG") to avoid a warning in kvm_queue_exception(),
> presumably to avoid overriding a pending exception.
>
> This added another (arguably nice) property, if there's a pending
> exception, KVM_SET_GUEST_DEBUG cannot cause a #DF or triple fault.
> However, if an exception is injected, KVM_SET_GUEST_DEBUG will cause
> a #DF or triple fault in the guest, as kvm_multiple_exception() combines
> them.
First off, this patch looks good irrespective of nested crud. Disallowing injection
of #DB/#BP while there's already an injected exception aligns with architectural
behavior; KVM needs to finish delivering the exception and thus "complete" the
instruction before queueing a new exception.
As for nested, I _was_ going to say that, assuming the original motivation for
this patch is to avoid triggering a nested VM-Exit while nested_run_pending=1,
this is incomplete. E.g. if the #BP or #DB itself is being intercepted by L1,
then queueing the exception will incorrectly deliver the nested VM-Exit before
nested VM-Enter completes. I.e. I _thought_ we would also need:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index db3f393192d9..fdbd272027ed 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12526,6 +12526,9 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
if (vcpu->arch.guest_state_protected)
return -EINVAL;
+ if (vcpu->arch.nested_run_pending)
+ return -EBUSY;
+
vcpu_load(vcpu);
if (dbg->control & (KVM_GUESTDBG_INJECT_DB | KVM_GUESTDBG_INJECT_BP)) {
But that isn't actually the case, because {svm,vmx}_check_nested_events() blocks
exceptions and VM-Exits while nested_run_pending=1. Off-list, I had rejected
modifying nested_vmx_triple_fault() to drop the triple fault like so:
@@ -5191,6 +5191,9 @@ void __nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
static void nested_vmx_triple_fault(struct kvm_vcpu *vcpu)
{
+ if (to_vmx(vcpu)->nested.nested_run_pending)
+ return;
+
kvm_clear_request(KVM_REQ_TRIPLE_FAULT, vcpu);
nested_vmx_vmexit(vcpu, EXIT_REASON_TRIPLE_FAULT, 0, 0);
}
because "That would largely defeat the entire purpose of the WARN. The whole point
is to detect bugs where KVM synthesizes a VM-Exit before completing nested VM-Enter.
It should be impossible for nested_{vmx,svm}_triple_fault() to be called with
nested_run_pending=true."
Given the above, that argument doesn't hold up at first glance, because blocking
the triple fault if nested_run_pending=1 would be consistent with how other "synchronous"
events are handled by nVMX and nSVM. But after staring at this a bit, unless I'm
forgetting something (entirely possible), I actually think {svm,vmx}_check_nested_events()
are wrong, because I stand behind my quoted statement: it should be impossible for
KVM to synthesize a synchronous event before completing nested VM-Enter.
Blocking pending and injected exceptions was added by bfcf83b1444d ("KVM: nVMX:
Fix trying to cancel vmlauch/vmresume"), and unfortunately neither the changelog
nor Lore[1] provides any details as to what exactly prompted the fix. I _suspect_
it was either syzkaller induced, or the manifestation of other bugs in KVM's
exception handling.
Finally getting to the point, in theory, I _think_ KVM should actually WARN on
all cases where it temporarily blocks a synchronous event due to a pending nested
VM-Enter (see below diff for the basic gist). However, that would open the
floodgates to syzkaller, because KVM_SET_VCPU_EVENTS can obviously stuff events,
KVM_X86_SET_MCE can queue a #MC, etc.
I _think_ we might be able to get away with rejecting KVM_SET_VCPU_EVENTS if
nested_run_pending=1, without breaking userspace? Google's VMM is insane and does
KVM_SET_VCPU_EVENTS before KVM_SET_NESTED_STATE, but in real usage, i.e. outside
of selftests, (I hope) no VMM will restore into a "live" vCPU.
So instead of patch 1, I want to try either (a) blocking KVM_SET_VCPU_EVENTS,
KVM_X86_SET_MCE, and KVM_SET_GUEST_DEBUG if nested_run_pending=1, *and* follow-up
with the below WARN-spree, or (b) add a separate flag, e.g. nested_run_in_progress
or so, that is set with nested_run_pending, but cleared on an exit to userspace,
and then WARN on _that_, i.e. so that we can detect KVM bugs (the whole point of
the WARN) and hopefully stop playing this losing game of whack-a-mole with syzkaller.
I think I'm leaning toward (b)? Except for KVM_SET_GUEST_DEBUG, where userspace
is trying to interpose on the guest, restricting ioctls doesn't really add any
value in practice. Yeah, in theory it could _maybe_ prevent userspace from shooting
itself in the foot, but practically speaking, if userspace is restoring state into
a vCPU with nested_run_pending=1, it's either playing on expert mode or is already
completely broken.
My only hesitation with (b) is that KVM wouldn't be entirely consistent, since
vmx_unhandleable_emulation_required() _does_ explicitly reject a "userspace did
something stupid with nested_run_pending=1" case. So from that perspective, part
of me wants to get greedy and try for (a).
[1] https://lore.kernel.org/all/?q=%22KVM:%20nVMX:%20Fix%20trying%20to%20cancel%20vmlauch%22
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index de90b104a0dd..d624e4db704a 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -1605,14 +1605,14 @@ static int svm_check_nested_events(struct kvm_vcpu *vcpu)
}
if (vcpu->arch.exception_vmexit.pending) {
- if (block_nested_exceptions)
+ if (WARN_ON_ONCE(block_nested_exceptions))
return -EBUSY;
nested_svm_inject_exception_vmexit(vcpu);
return 0;
}
if (vcpu->arch.exception.pending) {
- if (block_nested_exceptions)
+ if (WARN_ON_ONCE(block_nested_exceptions))
return -EBUSY;
return 0;
}
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 248635da6766..a223c5e86188 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4336,7 +4336,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
*/
if (vcpu->arch.exception_vmexit.pending &&
!vmx_is_low_priority_db_trap(&vcpu->arch.exception_vmexit)) {
- if (block_nested_exceptions)
+ if (WARN_ON_ONCE(block_nested_exceptions))
return -EBUSY;
nested_vmx_inject_exception_vmexit(vcpu);
@@ -4345,13 +4345,13 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
if (vcpu->arch.exception.pending &&
!vmx_is_low_priority_db_trap(&vcpu->arch.exception)) {
- if (block_nested_exceptions)
+ if (WARN_ON_ONCE(block_nested_exceptions))
return -EBUSY;
goto no_vmexit;
}
if (vmx->nested.mtf_pending) {
- if (block_nested_events)
+ if (WARN_ON_ONCE(block_nested_exceptions))
return -EBUSY;
nested_vmx_update_pending_dbg(vcpu);
nested_vmx_vmexit(vcpu, EXIT_REASON_MONITOR_TRAP_FLAG, 0, 0);
@@ -4359,7 +4359,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
}
if (vcpu->arch.exception_vmexit.pending) {
- if (block_nested_exceptions)
+ if (WARN_ON_ONCE(block_nested_exceptions))
return -EBUSY;
nested_vmx_inject_exception_vmexit(vcpu);
@@ -4367,7 +4367,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
}
if (vcpu->arch.exception.pending) {
- if (block_nested_exceptions)
+ if (WARN_ON_ONCE(block_nested_exceptions))
return -EBUSY;
goto no_vmexit;
}