Re: [PATCH] KVM: x86: Defer non-architectural deliver of exception payload to userspace read

From: Yosry Ahmed

Date: Tue Feb 17 2026 - 20:06:37 EST

On Tue, Feb 17, 2026 at 04:54:38PM -0800, Sean Christopherson wrote:
> When attempting to play nice with userspace that hasn't enabled
> KVM_CAP_EXCEPTION_PAYLOAD, defer KVM's non-architectural delivery of the
> payload until userspace actually reads relevant vCPU state, and more
> importantly, force delivery of the payload in *all* paths where userspace
> saves relevant vCPU state, not just KVM_GET_VCPU_EVENTS.
>
> Ignoring userspace save/restore for the moment, delivering the payload
> before the exception is injected is wrong regardless of whether L1 or L2
> is running. To make matters even more confusing, the flaw *currently*
> being papered over by the !is_guest_mode() check isn't even the same bug
> that commit da998b46d244 ("kvm: x86: Defer setting of CR2 until #PF
> delivery") was trying to avoid.
>
> At the time of commit da998b46d244, KVM didn't correctly handle exception
> intercepts, as KVM would wait until VM-Entry into L2 was imminent to check
> if the queued exception should morph to a nested VM-Exit. I.e. KVM would
> deliver the payload to L2 and then synthesize a VM-Exit into L1. But the
> payload was only the most blatant issue, e.g. waiting to check exception
> intercepts would also lead to KVM incorrectly escalating a
> should-be-intercepted #PF into a #DF.
>
> That underlying bug was eventually fixed by commit 7709aba8f716 ("KVM: x86:
> Morph pending exceptions to pending VM-Exits at queue time"), but in the
> interim, commit a06230b62b89 ("KVM: x86: Deliver exception payload on
> KVM_GET_VCPU_EVENTS") came along and subtly added another dependency on
> the !is_guest_mode() check.
>
> While not recorded in the changelog, the motivation for deferring the
> !exception_payload_enabled delivery was to fix a flaw where a synthesized
> MTF (Monitor Trap Flag) VM-Exit would drop a pending #DB and clobber DR6.
> On a VM-Exit, VMX CPUs save pending #DB information into the VMCS, which
> is emulated by KVM in nested_vmx_update_pending_dbg() by grabbing the
> payload from the queue/pending exception. I.e. prematurely delivering the
> payload would cause the pending #DB to not be recorded in the VMCS, and of
> course, clobber L2's DR6 as seen by L1.
>
> Jumping back to save+restore, the quirked behavior of forcing delivery of
> the payload only works if userspace does KVM_GET_VCPU_EVENTS *before*
> CR2 or DR6 is saved, i.e. before KVM_GET_SREGS{,2} and KVM_GET_DEBUGREGS.
> E.g. if userspace does KVM_GET_SREGS before KVM_GET_VCPU_EVENTS, then the
> CR2 saved by userspace won't contain the payload for the exception save by
> KVM_GET_VCPU_EVENTS.
>
> Deliberately deliver the payload in the store_regs() path, as it's the
> least awful option even though userspace may not be doing save+restore.
> Because if userspace _is_ doing save restore, it could elide KVM_GET_SREGS
> knowing that SREGS were already saved when the vCPU exited.
>
> Link: https://lore.kernel.org/all/20200207103608.110305-1-oupton@xxxxxxxxxx
> Cc: Yosry Ahmed <yosry.ahmed@xxxxxxxxx>
> Cc: stable@xxxxxxxxxxxxxxx
> Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>

Seems like this is the same change as the one in
https://lore.kernel.org/kvm/aYI4d0zPw3K5BedW@xxxxxxxxxx/, in which case:

Reviewed-by: Yosry Ahmed <yosry.ahmed@xxxxxxxxx>
Tested-by: Yosry Ahmed <yosry.ahmed@xxxxxxxxx>

> ---
> arch/x86/kvm/x86.c | 62 +++++++++++++++++++++++++++++-----------------
> 1 file changed, 39 insertions(+), 23 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index db3f393192d9..365ce3ea4a32 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -864,9 +864,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu, unsigned int nr,
> vcpu->arch.exception.error_code = error_code;
> vcpu->arch.exception.has_payload = has_payload;
> vcpu->arch.exception.payload = payload;
> - if (!is_guest_mode(vcpu))
> - kvm_deliver_exception_payload(vcpu,
> - &vcpu->arch.exception);
> return;
> }
>
> @@ -5532,18 +5529,8 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
> return 0;
> }
>
> -static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
> - struct kvm_vcpu_events *events)
> +static struct kvm_queued_exception *kvm_get_exception_to_save(struct kvm_vcpu *vcpu)
> {
> - struct kvm_queued_exception *ex;
> -
> - process_nmi(vcpu);
> -
> -#ifdef CONFIG_KVM_SMM
> - if (kvm_check_request(KVM_REQ_SMI, vcpu))
> - process_smi(vcpu);
> -#endif
> -
> /*
> * KVM's ABI only allows for one exception to be migrated. Luckily,
> * the only time there can be two queued exceptions is if there's a
> @@ -5554,21 +5541,46 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
> if (vcpu->arch.exception_vmexit.pending &&
> !vcpu->arch.exception.pending &&
> !vcpu->arch.exception.injected)
> - ex = &vcpu->arch.exception_vmexit;
> - else
> - ex = &vcpu->arch.exception;
> + return &vcpu->arch.exception_vmexit;
> +
> + return &vcpu->arch.exception;
> +}
> +
> +static void kvm_handle_exception_payload_quirk(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_queued_exception *ex = kvm_get_exception_to_save(vcpu);
>
> /*
> - * In guest mode, payload delivery should be deferred if the exception
> - * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1
> - * intercepts #PF, ditto for DR6 and #DBs. If the per-VM capability,
> - * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not
> - * propagate the payload and so it cannot be safely deferred. Deliver
> - * the payload if the capability hasn't been requested.
> + * If KVM_CAP_EXCEPTION_PAYLOAD is disabled, then (prematurely) deliver
> + * the pending exception payload when userspace saves *any* vCPU state
> + * that interacts with exception payloads to avoid breaking userspace.
> + *
> + * Architecturally, KVM must not deliver an exception payload until the
> + * exception is actually injected, e.g. to avoid losing pending #DB
> + * information (which VMX tracks in the VMCS), and to avoid clobbering
> + * state if the exception is never injected for whatever reason. But
> + * if KVM_CAP_EXCEPTION_PAYLOAD isn't enabled, then userspace may or
> + * may not propagate the payload across save+restore, and so KVM can't
> + * safely defer delivery of the payload.
> */
> if (!vcpu->kvm->arch.exception_payload_enabled &&
> ex->pending && ex->has_payload)
> kvm_deliver_exception_payload(vcpu, ex);
> +}
> +
> +static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
> + struct kvm_vcpu_events *events)
> +{
> + struct kvm_queued_exception *ex = kvm_get_exception_to_save(vcpu);
> +
> + process_nmi(vcpu);
> +
> +#ifdef CONFIG_KVM_SMM
> + if (kvm_check_request(KVM_REQ_SMI, vcpu))
> + process_smi(vcpu);
> +#endif
> +
> + kvm_handle_exception_payload_quirk(vcpu);
>
> memset(events, 0, sizeof(*events));
>
> @@ -5747,6 +5759,8 @@ static int kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu,
> vcpu->arch.guest_state_protected)
> return -EINVAL;
>
> + kvm_handle_exception_payload_quirk(vcpu);
> +
> memset(dbgregs, 0, sizeof(*dbgregs));
>
> BUILD_BUG_ON(ARRAY_SIZE(vcpu->arch.db) != ARRAY_SIZE(dbgregs->db));
> @@ -12137,6 +12151,8 @@ static void __get_sregs_common(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
> if (vcpu->arch.guest_state_protected)
> goto skip_protected_regs;
>
> + kvm_handle_exception_payload_quirk(vcpu);
> +
> kvm_get_segment(vcpu, &sregs->cs, VCPU_SREG_CS);
> kvm_get_segment(vcpu, &sregs->ds, VCPU_SREG_DS);
> kvm_get_segment(vcpu, &sregs->es, VCPU_SREG_ES);
>
> base-commit: 183bb0ce8c77b0fd1fb25874112bc8751a461e49
> --
> 2.53.0.335.g19a08e0c02-goog
>