Re: [PATCH 1/1] KVM: nVMX: update VPPR on vmlaunch/vmresume

From: Chao Gao
Date: Thu Oct 10 2024 - 07:00:45 EST


On Wed, Oct 02, 2024 at 10:20:11AM -0700, Sean Christopherson wrote:
>On Wed, Oct 02, 2024, Sean Christopherson wrote:
>> On Wed, Oct 02, 2024, Sean Christopherson wrote:
>> > On Wed, Oct 02, 2024, Markku Ahvenjärvi wrote:
>> > > Hi Sean,
>> > >
>> > > > On Fri, Sep 20, 2024, Markku Ahvenjärvi wrote:
>> > > > > Running certain hypervisors under KVM on VMX suffered L1 hangs after
>> > > > > launching a nested guest. The external interrupts were not processed on
>> > > > > vmlaunch/vmresume due to stale VPPR, and L2 guest would resume without
>> > > > > allowing L1 hypervisor to process the events.
>> > > > >
>> > > > > The patch ensures VPPR to be updated when checking for pending
>> > > > > interrupts.
>> > > >
>> > > > This is architecturally incorrect, PPR isn't refreshed at VM-Enter.
>> > >
>> > > I looked into this and found the following from Intel manual:
>> > >
>> > > "30.1.3 PPR Virtualization
>> > >
>> > > The processor performs PPR virtualization in response to the following
>> > > operations: (1) VM entry; (2) TPR virtualization; and (3) EOI virtualization.
>> > >
>> > > ..."
>> > >
>> > > The section "27.3.2.5 Updating Non-Register State" further explains the VM
>> > > enter:
>> > >
>> > > "If the “virtual-interrupt delivery” VM-execution control is 1, VM entry loads
>> > > the values of RVI and SVI from the guest interrupt-status field in the VMCS
>> > > (see Section 25.4.2). After doing so, the logical processor first causes PPR
>> > > virtualization (Section 30.1.3) and then evaluates pending virtual interrupts
>> > > (Section 30.2.1). If a virtual interrupt is recognized, it may be delivered in
>> > > VMX non-root operation immediately after VM entry (including any specified
>> > > event injection) completes; ..."
>> > >
>> > > According to that, PPR is supposed to be refreshed at VM-Enter, or am I
>> > > missing something here?
>> >
>> > Huh, I missed that. It makes sense I guess; VM-Enter processes pending virtual
>> > interrupts, so it stands that VM-Enter would refresh PPR as well.
>> >
>> > Ugh, and looking again, KVM refreshes PPR every time it checks for a pending
>> > interrupt, including the VM-Enter case (via kvm_apic_has_interrupt()) when nested
>> > posted interrupts are in use:
>> >
>> > /* Emulate processing of posted interrupts on VM-Enter. */
>> > if (nested_cpu_has_posted_intr(vmcs12) &&
>> > kvm_apic_has_interrupt(vcpu) == vmx->nested.posted_intr_nv) {
>> > vmx->nested.pi_pending = true;
>> > kvm_make_request(KVM_REQ_EVENT, vcpu);
>> > kvm_apic_clear_irr(vcpu, vmx->nested.posted_intr_nv);
>> > }
>> >
>> > I'm still curious as to what's different about your setup, but certainly not
>> > curious enough to hold up a fix.
>>
>> Actually, none of the above is even relevant. PPR virtualization in the nested
>> VM-Enter case would be for _L2's_ vPRR, not L1's. And that virtualization is
>> performed by hardware (vmcs02 has the correct RVI, SVI, and vAPIC information
>> for L2).
>>
>> Which means my initial instinct that KVM is missing a PPR update somewhere is
>> likely correct.
>
>Talking to myself :-)
>
>Assuming it actually fixes your issue, this is what I'm planning on posting. I
>suspect KVM botches something when the deprivileged host is active, but given
>that the below will allow for additional cleanups, and practically speaking doesn't
>have any downsides, I don't see any reason to withhold the hack-a-fix. Though
>hopefully we'll someday figure out exactly what's broken.

The issue is that KVM does not properly update vmcs01's SVI. In this case, L1
does not intercept EOI MSR writes from the deprivileged host (L2), so KVM
emulates EOI writes by clearing the highest bit in vISR and updating vPPR.
However, SVI in vmcs01 is not updated, causing it to retain the interrupt vector
that was just EOI'd. On the next VM-entry to L1, the CPU performs PPR
virtualization, setting vPPR to SVI & 0xf0, which results in an incorrect vPPR

Can you try this fix?

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4a93ac1b9be9..3d24194a648d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -122,6 +122,8 @@
#define KVM_REQ_HV_TLB_FLUSH \
KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
#define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE KVM_ARCH_REQ(34)
+#define KVM_REQ_UPDATE_HWAPIC_ISR \
+ KVM_ARCH_REQ_FLAGS(35, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)

#define CR0_RESERVED_BITS \
(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
@@ -764,6 +766,7 @@ struct kvm_vcpu_arch {
u64 apic_base;
struct kvm_lapic *apic; /* kernel irqchip context */
bool load_eoi_exitmap_pending;
+ bool update_hwapic_isr;
DECLARE_BITMAP(ioapic_handled_vectors, 256);
unsigned long apic_attention;
int32_t apic_arb_prio;
diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index b1eb46e26b2e..a8dad16161e4 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -220,6 +220,11 @@ static inline void leave_guest_mode(struct kvm_vcpu *vcpu)
kvm_make_request(KVM_REQ_LOAD_EOI_EXITMAP, vcpu);
}

+ if (vcpu->arch.update_hwapic_isr) {
+ vcpu->arch.update_hwapic_isr = false;
+ kvm_make_request(KVM_REQ_UPDATE_HWAPIC_ISR, vcpu);
+ }
+
vcpu->stat.guest_mode = 0;
}

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 5bb481aefcbc..d6a03c30f085 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -800,6 +800,9 @@ static inline void apic_clear_isr(int vec, struct kvm_lapic *apic)
if (!__apic_test_and_clear_vector(vec, apic->regs + APIC_ISR))
return;

+ if (is_guest_mode(apic->vcpu))
+ apic->vcpu->arch.update_hwapic_isr = true;
+
/*
* We do get here for APIC virtualization enabled if the guest
* uses the Hyper-V APIC enlightenment. In this case we may need
@@ -3068,6 +3071,14 @@ int kvm_apic_set_state(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s)
return 0;
}

+void kvm_vcpu_update_hwapic_isr(struct kvm_vcpu *vcpu)
+{
+ struct kvm_lapic *apic = vcpu->arch.apic;
+
+ if (apic->apicv_active)
+ kvm_x86_call(hwapic_isr_update)(apic_find_highest_isr(apic));
+}
+
void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
{
struct hrtimer *timer;
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 7ef8ae73e82d..ffa0c0e8bda9 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -266,6 +266,7 @@ void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu);
bool kvm_lapic_hv_timer_in_use(struct kvm_vcpu *vcpu);
void kvm_lapic_restart_hv_timer(struct kvm_vcpu *vcpu);
bool kvm_can_use_hv_timer(struct kvm_vcpu *vcpu);
+void kvm_vcpu_update_hwapic_isr(struct kvm_vcpu *vcpu);

static inline enum lapic_mode kvm_apic_mode(u64 apic_base)
{
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 34b52b49f5e6..d90add3fbe99 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10968,6 +10968,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
#endif
if (kvm_check_request(KVM_REQ_APICV_UPDATE, vcpu))
kvm_vcpu_update_apicv(vcpu);
+ if (kvm_check_request(KVM_REQ_UPDATE_HWAPIC_ISR, vcpu))
+ kvm_vcpu_update_hwapic_isr(vcpu);
if (kvm_check_request(KVM_REQ_APF_READY, vcpu))
kvm_check_async_pf_completion(vcpu);
if (kvm_check_request(KVM_REQ_MSR_FILTER_CHANGED, vcpu))

>
>---
>From: Sean Christopherson <seanjc@xxxxxxxxxx>
>Date: Wed, 2 Oct 2024 08:53:23 -0700
>Subject: [PATCH] KVM: nVMX: Explicitly update vPPR on successful nested
> VM-Enter
>MIME-Version: 1.0
>Content-Type: text/plain; charset=UTF-8
>Content-Transfer-Encoding: 8bit
>
>Request pending event evaluation after successful nested VM-Enter if L1
>has a pending IRQ, which in turn refreshes vPPR based on vTPR and vISR.
>This fixes an issue where KVM will fail to deliver a pending IRQ to L1
>when running an atypical hypervisor in L1, e.g. the pKVM port to VMX.
>
>Odds are very good that refreshing L1's vPPR is papering over a missed
>PPR update somewhere, e.g. the underlying bug likely related to the fact
>that pKVM passes through its APIC to the depriveleged host (which is an
>L2 guest from KVM's perspective).
>
>However, KVM updates PPR _constantly_, even when PPR technically shouldn't
>be refreshed, e.g. kvm_vcpu_has_events() re-evaluates PPR if IRQs are
>unblocked, by way of the same kvm_apic_has_interrupt() check. Ditto for
>nested VM-Enter itself, when nested posted interrupts are enabled. Thus,
>trying to avoid a PPR update on VM-Enter just to be pedantically accurate
>is ridiculous, given the behavior elsewhere in KVM.
>
>Unconditionally checking for interrupts will also allow for additional
>cleanups, e.g. the manual RVI check earlier in VM-Enter emulation by
>by vmx_has_apicv_interrupt() can be dropped, and the aforementioned nested
>posted interrupt logic can be massaged to better honor priority between
>concurrent events.
>
>Link: https://lore.kernel.org/kvm/20230312180048.1778187-1-jason.cj.chen@xxxxxxxxx
>Reported-by: Markku Ahvenjärvi <mankku@xxxxxxxxx>
>Closes: https://lore.kernel.org/all/20240920080012.74405-1-mankku@xxxxxxxxx
>Suggested-by: Markku Ahvenjärvi <mankku@xxxxxxxxx>
>Cc: Janne Karhunen <janne.karhunen@xxxxxxxxx>
>Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
>---
> arch/x86/kvm/vmx/nested.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
>diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>index a8e7bc04d9bf..784b61c9810b 100644
>--- a/arch/x86/kvm/vmx/nested.c
>+++ b/arch/x86/kvm/vmx/nested.c
>@@ -3593,7 +3593,8 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
> * effectively unblock various events, e.g. INIT/SIPI cause VM-Exit
> * unconditionally.
> */
>- if (unlikely(evaluate_pending_interrupts))
>+ if (unlikely(evaluate_pending_interrupts) ||
>+ kvm_apic_has_interrupt(vcpu))
> kvm_make_request(KVM_REQ_EVENT, vcpu);
>
> /*
>
>base-commit: e32cde8d2bd7d251a8f9b434143977ddf13dcec6
>--
>