Re: [PATCH 3/3] KVM: x86: Check for injected exceptions before queuing a debug exception

From: Sean Christopherson

Date: Fri Feb 27 2026 - 13:18:51 EST

On Fri, Feb 27, 2026, Yosry Ahmed wrote:
> On Fri, Feb 27, 2026 at 8:34 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > On Fri, Feb 27, 2026, Sean Christopherson wrote:
> > > So instead of patch 1, I want to try either (a) blocking KVM_SET_VCPU_EVENTS,
> > > KVM_X86_SET_MCE, and KVM_SET_GUEST_DEBUG if nested_run_pending=1, *and* follow-up
> > > with the below WARN-spree, or (b) add a separate flag, e.g. nested_run_in_progress
> > > or so, that is set with nested_run_pending, but cleared on an exit to userspace,
> > > and then WARN on _that_, i.e. so that we can detect KVM bugs (the whole point of
> > > the WARN) and hopefully stop playing this losing game of whack-a-mole with syzkaller.
>
> I like the idea of the WARN there, although something in the back of
> my mind tells me I went through this code before with an exception in
> mind that could be injected with nested_run_pending=1, but I can't
> remember it. Maybe it was injected by userspace and all is good.

If there is such a flow, it's likely a bug, i.e. we'd want the WARN. AFAIK,
every single time the WARN has been hit in the last ~2-3 years has been due to
syzkaller.

> That being said, I hate nested_run_in_progress. It's too close to
> nested_run_pending and I am pretty sure they will be mixed up.

Agreed, though the fact that name is _too_ close means that, aside from the
potential for disaster (minor detail), it's accurate.

One thought is to hide nested_run_in_progress beyond a KConfig, so that attempts
to use it for anything but the sanity check(s) would fail the build. I don't
really want to create yet another KVM_PROVE_xxx though, but unlike KVM_PROVE_MMU,
I think we want to this enabled in production.

I'll chew on this a bit...

> exception_from_userspace's name made me think this is something we
> could key off to WARN, but it's meant to morph queued exceptions from
> userspace into an "exception_vmexit" if needed. The field name is
> generic but its functionality isn't, maybe it should have been called
> exception_check_vmexit or something. Anyway..

No? It's not a "check", it's literally an pending exception that has been morphed
to a VM-Exit.

Hmm, though looking at all that code again, I bet we can dedup a _lot_ code by
adding kvm_queued_exception.is_vmexit instead of tracking a completely separate
exception. The only potential hiccup I can think of is if there's some wrinkle
with the interaction with already pending/injected exceptions. Pending should be
fine, as the VM-Exit has priority.

Ah, scratch that idea, injected exceptions need to be tracked separate, e.g. see
vmcs12_save_pending_event(). It's correct for vmx_check_nested_events() to
deliver a VM-Exit even if there is an already-injected exception, e.g. if an EPT
Violation in L1's purview triggers when vectoring an injected exception, but in
that case, KVM needs to save the injected exception information into vmc{b,s}12.

> That gave me an idea though, can we add a field to
> kvm_queued_exception to identify the origin of the exception
> (userspace vs. KVM)? Then we can key the warning off of that.

That would incur non-trivial maintenance costs, and it would be tricky to get the
broader protection of the existing WARNing "right". E.g. how would KVM know that
the VM-Exit was originally induced by an exception that was queued by userspace?

> We can potentially also avoid adding the field and just plumb the
> argument through to kvm_multiple_exception(), and WARN there if
> nested_run_pending is set and the origin is not userspace?

Not really, because kvm_vcpu_ioctl_x86_set_vcpu_events() doesn't use
kvm_queued_exception(), it stuffs things directly.

That said, if you want to try and code it up, I say go for it. Worst case scenario
you'll have wasted a bit of time.

> > > I think I'm leaning toward (b)? Except for KVM_SET_GUEST_DEBUG, where userspace
> > > is trying to interpose on the guest, restricting ioctls doesn't really add any
> > > value in practice. Yeah, in theory it could _maybe_ prevent userspace from shooting
> > > itself in the foot, but practically speaking, if userspace is restoring state into
> > > a vCPU with nested_run_pending=1, it's either playing on expert mode or is already
> > > completely broken.
> > >
> > > My only hesitation with (b) is that KVM wouldn't be entirely consistent, since
> > > vmx_unhandleable_emulation_required() _does_ explicitly reject a "userspace did
> > > something stupid with nested_run_pending=1" case. So from that perspective, part
> > > of me wants to get greedy and try for (a).
> >
> > On second (fifth?) thought, I don't think (a) is a good idea. In addition to
> > potentially breaking userspace, it also risks preventing genuinely useful sequences.
> > E.g. even if no VMM does so today, it's entirely plausible that a VMM could want
> > to asynchronously inject an #MC to mimic a broadcast, and that the injection could
> > collide with a pending nested VM-Enter.
> >
> > I'll send a separate (maybe RFC?) series for (b) using patch 1 as a starting point.
> > I want to fiddle around with some ideas, and it'll be faster to sketch things out
> > in code versus trying to describe things in text.
>
> So you'll apply patch 3 as-is, drop patch 2, and (potentially) take
> patch 1 and build another series on top of it?

Yeah, that's where I'm trending.