Re: [PATCH] KVM: x86: Exit to userspace when kvm_check_nested_events fails

From: Sean Christopherson
Date: Wed Jul 28 2021 - 13:53:53 EST


On Wed, Jul 28, 2021, Paolo Bonzini wrote:
> On 28/07/21 14:39, Vitaly Kuznetsov wrote:
> > Shouldn't we also change kvm_arch_vcpu_runnable() and check
> > 'kvm_vcpu_running() > 0' now?
>
> I think leaving kvm_vcpu_block on error is the better choice, so it should
> be good with returning true if kvm_vcpu_running(vcpu) < 0.

Blech. This is all gross. There is a subtle bug lurking in both Jim's approach
and in this approach. It's not detected because the selftest exercises a bad PI
descriptor, not a bad vAPIC page.

In Jim's approach of returning 'true' from kvm_vcpu_running() if
kvm_check_nested_events() fails due to vmx_complete_nested_posted_interrupt()
detecting a bad vAPIC page, the resulting KVM_EXIT_INTERNAL_ERROR will be "lost"
due to vmx->nested.pi_pending being cleared. KVM runs the vCPU, but skips over
the PI check in inject_pending_event() due to vmx->nested.pi_pending==false.
The selftest works because the bad PI descriptor case is handled _before_
pi_pending is cleared.

This approach mostly fixes that bug by virtue of returning immediately in the
vcpu_run() case, but if the bad vAPIC page is encountered via
kvm_arch_vcpu_runnable(), KVM will effectively drop the error. This can be
hack-a-fixed by pre-checking the vAPIC page. That's arguably architecturally
wrong as the vAPIC emulation access shouldn't occur until after PI.ON is cleared,
but from KVM's perspective I think it's the least awful "fix" given the current
train wreck.

Alternatively, what about punting all of this in favor of targeting the full
cleanup[*] for 5.15? I believe I have the bandwidth to pick that up.

[*] https://lkml.kernel.org/r/YKWI1GPdNc4shaCt@xxxxxxxxxx

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 0d0dd6580cfd..8d1c8217954a 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3707,6 +3707,10 @@ static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
if (!vmx->nested.pi_desc)
goto mmio_needed;

+ vapic_page = vmx->nested.virtual_apic_map.hva;
+ if (!vapic_page)
+ goto mmio_needed;
+
vmx->nested.pi_pending = false;

if (!pi_test_and_clear_on(vmx->nested.pi_desc))
@@ -3714,10 +3718,6 @@ static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)

max_irr = find_last_bit((unsigned long *)vmx->nested.pi_desc->pir, 256);
if (max_irr != 256) {
- vapic_page = vmx->nested.virtual_apic_map.hva;
- if (!vapic_page)
- goto mmio_needed;
-
__kvm_apic_update_irr(vmx->nested.pi_desc->pir,
vapic_page, &max_irr);
status = vmcs_read16(GUEST_INTR_STATUS);

[*]