Re: [PATCH v4 26/30] KVM: x86: Don't treat interrupts as allowed just because a nested run is pending

From: Sean Christopherson

Date: Tue Jun 16 2026 - 13:46:38 EST

On Mon, Jun 15, 2026, Yosry Ahmed wrote:
> > > The code makes sense to me but I am trying to make sense of the changelog.
> >
> > What part (parts?) is confusing? Honest question. I'm trying to reword the
> > changelog to make it "better", but I'm failing miserable because I don't know
> > what's wrong :-)
>
> 1. For kvm_vcpu_has_events() being unaffected, the explanation in
> paragraph #3 is focused on the code path from nested_vmx_run() ->
> kvm_emulate_halt_noskip(). I don't immediately see how
> kvm_arch_vcpu_runnable() is unaffected.

To reach kvm_vcpu_has_events(), kvm_vcpu_running() needs to return false. For
that to happen, vcpu->arch.mp_state needs to be something other than RUNNABLE.

If nested_run_pending is true, then mp_state *must* be RUNNABLE (barring bugs or
stupid userspace), because KVM shouldn't emulate VMRUN/VMLAUNCH/VMRESUME while
the vCPU is !RUNNABLE.

I didn't include that in the changelog because I thought it was obvious, but
obviously (LOL) not :-D

I called out the GUEST_ACTIVITY_HLT case because (to me) that is less obvious.

> 2. More importantly, paragraphs #3 and #4 read like this patch would
> regress kvm_vcpu_ready_for_interrupt_injection() and
> kvm_vcpu_has_events() if it affected them. Maybe clearly state that
> this patch is the right thing to do for these 2 functions as well, but
> they are more-or-less unaffected by the bug anyway? For
> kvm_vcpu_ready_for_interrupt_injection(), maybe just make it more
> clear in paragraph #4 that it currently incorrectly treats interrupts
> as allowed in the problematic scenario, but it is not a problem
> because ..., and it only results in a spurious exit to userspace (or
> not even that?).

Is this better?

When querying whether or not interrupts (IRQs) are allowed, check for a
pending nested run _after_ checking whether or not interrupts are blocked.
If L1 is running L2 _without_ nested_exit_on_intr(), i.e. if L1 IRQs can
be blocked while running L2, and interrupts will indeed be blocked once the
nested VM-Enter to L2 is completed, then KVM should treat interrupts as not
being allowed.

For injection, this avoids an unnecessary (forced) VM-Exit, as KVM can
immediately request an IRQ window, instead of forcing an exit and _then_
requesting an IRQ window (because after the forced exit, KVM will see that
interrupts are blocked).

For non-injection usage, only kvm_vcpu_ready_for_interrupt_injection() is
affected in practice. Barring KVM bugs or misbehaving userspace (at which
point all architectural guarantees are off), kvm_vcpu_has_events() is
unreachable when a nested run is pending. To reach kvm_vcpu_has_events(),
kvm_vcpu_running() needs to return false, i.e. vcpu->arch.mp_state needs
to be something other than RUNNABLE. If nested_run_pending is true, then
mp_state *must* be RUNNABLE (again barring bugs or stupid userspace),
because KVM shouldn't emulate VMRUN/VMLAUNCH/VMRESUME while the vCPU is
!RUNNABLE.

The one "near miss" is VMX's GUEST_ACTIVITY_STATE field, which allows L1 to
put the vCPU into HLT or WFS as part of nested VMLAUNCH/VMRESUME. However,
KVM clears nested_run_pending prior to calling kvm_emulate_halt_noskip()
when putting L2 into HLT via GUEST_ACTIVITY_HLT, and also clears the flag
before setting mp_state to INIT_RECEIVED. SVM has no equivalent to
GUEST_ACTIVITY_STATE.

I.e. the vCPU will always be runnable if a nested run is pending, and thus
kvm_arch_vcpu_runnable() => kvm_vcpu_has_events() is effectively dead code,
as is __kvm_emulate_halt() => kvm_vcpu_has_events(). Oh, and TDX doesn't
support nested VMX. Similarly, kvm_can_do_async_pf() is unreachable as
KVM shouldn't be faulting in memory with a pending nested VM-Enter.

As for kvm_vcpu_ready_for_interrupt_injection(), KVM's current behavior of
incorrectly treating interrupts as being allowed could result in KVM
prematurely exiting to userspace to accept an ExtINT. But, KVM will still
hold/block the ExtINT and request its own IRQ window. I.e. the net effect
is more or less the same as the for-injection case, the unnecessary exit
just happens at a different boundary.