Re: [PATCH 3/5] KVM: nSVM: Don't forget about L1-injected events
From: Maxim Levitsky
Date: Mon Apr 04 2022 - 05:53:44 EST
On Thu, 2022-03-10 at 22:38 +0100, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@xxxxxxxxxx>
>
> In SVM synthetic software interrupts or INT3 or INTO exception that L1
> wants to inject into its L2 guest are forgotten if there is an intervening
> L0 VMEXIT during their delivery.
>
> They are re-injected correctly with VMX, however.
>
> This is because there is an assumption in SVM that such exceptions will be
> re-delivered by simply re-executing the current instruction.
> Which might not be true if this is a synthetic exception injected by L1,
> since in this case the re-executed instruction will be one already in L2,
> not the VMRUN instruction in L1 that attempted the injection.
>
> Leave the pending L1 -> L2 event in svm->nested.ctl.event_inj{,err} until
> it is either re-injected successfully or returned to L1 upon a nested
> VMEXIT.
> Make sure to always re-queue such event if returned in EXITINTINFO.
>
> The handling of L0 -> {L1, L2} event re-injection is left as-is to avoid
> unforeseen regressions.
Some time ago I noticed this too, but haven't dug into this too much.
I rememeber I even had some half-baked patch for this I never posted,
because I didn't think about the possibility of this syntetic injection.
Just to be clear that I understand this correctly:
1. What is happening is that L1 is injecting INTn/INTO/INT3 but L2 code
doesn't actualy contain an INTn/INTO/INT3 instruction.
This is wierd but legal thing to do.
Again, if L2 actually contained the instruction, it would have worked?
2. When actual INTn/INT0/INT3 are intercepted on SVM, then
save.RIP points to address of the instruction, and control.next_rip
points to address of next instruction after (as expected)
3. When EVENTINJ is used with '(TYPE = 3) with vectors 3 or 4'
or 'TYPE=4', then next_rip is pushed on the stack, while save.RIP is
pretty much ignored, and exectution jumps to the handler in the IDT.
also at least for INT3/INTO, PRM states that IDT's DPL field is checked
before dispatch, meaning that we can get legit #GP during delivery.
(this looks like another legit reason to fix exception merging in KVM)
Best regards,
Maxim Levitsky
>
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@xxxxxxxxxx>
> ---
> arch/x86/kvm/svm/nested.c | 65 +++++++++++++++++++++++++++++++++++++--
> arch/x86/kvm/svm/svm.c | 17 ++++++++--
> arch/x86/kvm/svm/svm.h | 47 ++++++++++++++++++++++++++++
> 3 files changed, 125 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 9656f0d6815c..75017bf77955 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -420,8 +420,17 @@ void nested_copy_vmcb_save_to_cache(struct vcpu_svm *svm,
> void nested_sync_control_from_vmcb02(struct vcpu_svm *svm)
> {
> u32 mask;
> - svm->nested.ctl.event_inj = svm->vmcb->control.event_inj;
> - svm->nested.ctl.event_inj_err = svm->vmcb->control.event_inj_err;
> +
> + /*
> + * Leave the pending L1 -> L2 event in svm->nested.ctl.event_inj{,err}
> + * if its re-injection is needed
> + */
> + if (!exit_during_event_injection(svm, svm->nested.ctl.event_inj,
> + svm->nested.ctl.event_inj_err)) {
> + WARN_ON_ONCE(svm->vmcb->control.event_inj & SVM_EVTINJ_VALID);
> + svm->nested.ctl.event_inj = svm->vmcb->control.event_inj;
> + svm->nested.ctl.event_inj_err = svm->vmcb->control.event_inj_err;
> + }
Beware that this could backfire in regard to nested migration.
I once chased a nasty bug related to this.
The bug was:
- L1 does VMRUN with injection (EVENTINJ set)
- VMRUN exit handler is 'executed' by KVM,
This copies EVENINJ fields from VMCB12 to VMCB02
- Once VMRUN exit handler is done executing, we exit to userspace to start migration
(basically static_call(kvm_x86_handle_exit)(...) handles the SVM_EXIT_VMRUN,
and that is all, vcpu_enter_guest isn't called again, so injection is not canceled)
- migration happens and it migrates the control area of vmcb02 with EVENTINJ fields set.
- on migration target, we inject another interrupt to the guest via EVENTINJ
because svm_check_nested_events checks nested_run_pending to avoid doing this
but nested_run_pending was not migrated correctly,
and overwrites the EVENTINJ - injection is lost.
Paolo back then proposed to me that instead of doing direct copy from VMCB12 to VMCB02
we should instead go through 'vcpu->arch.interrupt' and such.
I had a prototype of this but never gotten to clean it up to be accepted upstream,
knowing that current way also works.
>
> /* Only a few fields of int_ctl are written by the processor. */
> mask = V_IRQ_MASK | V_TPR_MASK;
> @@ -669,6 +678,54 @@ static void nested_svm_copy_common_state(struct vmcb *from_vmcb, struct vmcb *to
> to_vmcb->save.spec_ctrl = from_vmcb->save.spec_ctrl;
> }
>
> +void nested_svm_maybe_reinject(struct kvm_vcpu *vcpu)
A personal taste note: I don't like the 'maybe' for some reason.
But I won't fight over this.
> +{
> + struct vcpu_svm *svm = to_svm(vcpu);
> + unsigned int vector, type;
> + u32 exitintinfo = svm->vmcb->control.exit_int_info;
> +
> + if (WARN_ON_ONCE(!is_guest_mode(vcpu)))
> + return;
> +
> + /*
> + * No L1 -> L2 event to re-inject?
> + *
> + * In this case event_inj will be cleared by
> + * nested_sync_control_from_vmcb02().
> + */
> + if (!(svm->nested.ctl.event_inj & SVM_EVTINJ_VALID))
> + return;
> +
> + /* If the last event injection was successful there shouldn't be any pending event */
> + if (WARN_ON_ONCE(!(exitintinfo & SVM_EXITINTINFO_VALID)))
> + return;
> +
> + kvm_make_request(KVM_REQ_EVENT, vcpu);
> +
> + vector = exitintinfo & SVM_EXITINTINFO_VEC_MASK;
> + type = exitintinfo & SVM_EXITINTINFO_TYPE_MASK;
> +
> + switch (type) {
> + case SVM_EXITINTINFO_TYPE_NMI:
> + vcpu->arch.nmi_injected = true;
> + break;
> + case SVM_EXITINTINFO_TYPE_EXEPT:
> + if (exitintinfo & SVM_EXITINTINFO_VALID_ERR)
> + kvm_requeue_exception_e(vcpu, vector,
> + svm->vmcb->control.exit_int_info_err);
> + else
> + kvm_requeue_exception(vcpu, vector);
> + break;
> + case SVM_EXITINTINFO_TYPE_SOFT:
Note that AFAIK, SVM_EXITINTINFO_TYPE_SOFT is only for INTn instructions,
while INT3 and INTO are considered normal exceptions but EVENTINJ
handling has special case for them.
On VMX it is a bit cleaner:
It has:
3 - normal stock exception caused by CPU itself, like #PF and such
4 - INTn
* does DPL check and uses VM_EXIT_INSTRUCTION_LEN
5 - ICEBP/INT1, which SVM doesnt support to re-inject
* doesn't do DPL check, but uses VM_EXIT_INSTRUCTION_LEN I think
6 - software exception (INT3/INTO)
* does DPL check and uses VM_EXIT_INSTRUCTION_LEN as well
I don't know if there is any difference between 4 and 6.
> + case SVM_EXITINTINFO_TYPE_INTR:
> + kvm_queue_interrupt(vcpu, vector, type == SVM_EXITINTINFO_TYPE_SOFT);
> + break;
> + default:
> + vcpu_unimpl(vcpu, "unknown L1 -> L2 exitintinfo type 0x%x\n", type);
> + break;
> + }
> +}
> +
> int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
> struct vmcb *vmcb12, bool from_vmrun)
> {
> @@ -898,6 +955,10 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
> if (svm->nrips_enabled)
> vmcb12->control.next_rip = vmcb->control.next_rip;
>
> + /* Forget about any pending L1 event injection since it's a L1 worry now */
> + svm->nested.ctl.event_inj = 0;
> + svm->nested.ctl.event_inj_err = 0;
> +
> vmcb12->control.int_ctl = svm->nested.ctl.int_ctl;
> vmcb12->control.tlb_ctl = svm->nested.ctl.tlb_ctl;
> vmcb12->control.event_inj = svm->nested.ctl.event_inj;
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 1e5d904aeec3..5b128baa5e57 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3322,13 +3322,18 @@ static void svm_inject_irq(struct kvm_vcpu *vcpu)
> {
> struct vcpu_svm *svm = to_svm(vcpu);
>
> - WARN_ON(!gif_set(svm));
> + WARN_ON(!(vcpu->arch.interrupt.soft || gif_set(svm)));
>
> trace_kvm_inj_virq(vcpu->arch.interrupt.nr);
> ++vcpu->stat.irq_injections;
>
> svm->vmcb->control.event_inj = vcpu->arch.interrupt.nr |
> - SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_INTR;
> + SVM_EVTINJ_VALID;
> + if (vcpu->arch.interrupt.soft) {
> + svm->vmcb->control.event_inj |= SVM_EVTINJ_TYPE_SOFT;
> + } else {
> + svm->vmcb->control.event_inj |= SVM_EVTINJ_TYPE_INTR;
> + }
> }
>
> void svm_complete_interrupt_delivery(struct kvm_vcpu *vcpu, int delivery_mode,
> @@ -3627,6 +3632,14 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
> if (!(exitintinfo & SVM_EXITINTINFO_VALID))
> return;
>
> + /* L1 -> L2 event re-injection needs a different handling */
> + if (is_guest_mode(vcpu) &&
> + exit_during_event_injection(svm, svm->nested.ctl.event_inj,
> + svm->nested.ctl.event_inj_err)) {
> + nested_svm_maybe_reinject(vcpu);
> + return;
> + }
> +
> kvm_make_request(KVM_REQ_EVENT, vcpu);
>
> vector = exitintinfo & SVM_EXITINTINFO_VEC_MASK;
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index f757400fc933..7cafc2e6c82a 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -488,6 +488,52 @@ static inline bool nested_npt_enabled(struct vcpu_svm *svm)
> return svm->nested.ctl.nested_ctl & SVM_NESTED_CTL_NP_ENABLE;
> }
>
> +static inline bool event_inj_same(u32 event_inj1, u32 event_inj_err1,
> + u32 event_inj2, u32 event_inj_err2)
> +{
> + unsigned int vector_1, vector_2, type_1, type_2;
> +
> + /* Either of them not valid? */
> + if (!(event_inj1 & SVM_EVTINJ_VALID) ||
> + !(event_inj2 & SVM_EVTINJ_VALID))
> + return false;
> +
> + vector_1 = event_inj1 & SVM_EVTINJ_VEC_MASK;
> + type_1 = event_inj1 & SVM_EVTINJ_TYPE_MASK;
> + vector_2 = event_inj2 & SVM_EVTINJ_VEC_MASK;
> + type_2 = event_inj2 & SVM_EVTINJ_TYPE_MASK;
> +
> + /* Different vector or type? */
> + if (vector_1 != vector_2 || type_1 != type_2)
> + return false;
> +
> + /* Different error code presence flag? */
> + if ((event_inj1 & SVM_EVTINJ_VALID_ERR) !=
> + (event_inj2 & SVM_EVTINJ_VALID_ERR))
> + return false;
> +
> + /* No error code? */
> + if (!(event_inj1 & SVM_EVTINJ_VALID_ERR))
> + return true;
> +
> + /* Same error code? */
> + return event_inj_err1 == event_inj_err2;
> +}
> +
> +/* Did the last VMEXIT happen when attempting to inject that event? */
> +static inline bool exit_during_event_injection(struct vcpu_svm *svm,
> + u32 event_inj, u32 event_inj_err)
> +{
> + BUILD_BUG_ON(SVM_EXITINTINFO_VEC_MASK != SVM_EVTINJ_VEC_MASK ||
> + SVM_EXITINTINFO_TYPE_MASK != SVM_EVTINJ_TYPE_MASK ||
> + SVM_EXITINTINFO_VALID != SVM_EVTINJ_VALID ||
> + SVM_EXITINTINFO_VALID_ERR != SVM_EVTINJ_VALID_ERR);
> +
> + return event_inj_same(svm->vmcb->control.exit_int_info,
> + svm->vmcb->control.exit_int_info_err,
> + event_inj, event_inj_err);
> +}
> +
> /* svm.c */
> #define MSR_INVALID 0xffffffffU
>
> @@ -540,6 +586,7 @@ static inline bool nested_exit_on_nmi(struct vcpu_svm *svm)
> return vmcb12_is_intercept(&svm->nested.ctl, INTERCEPT_NMI);
> }
>
> +void nested_svm_maybe_reinject(struct kvm_vcpu *vcpu);
> int enter_svm_guest_mode(struct kvm_vcpu *vcpu,
> u64 vmcb_gpa, struct vmcb *vmcb12, bool from_vmrun);
> void svm_leave_nested(struct kvm_vcpu *vcpu);
>
I will also review Sean's take on this, let see which one is simplier.
Best regards,
Maxim Levitsky