Re: [PATCH] KVM: x86: Fix lapic timer interrupt lost after loading a snapshot.

From: Sean Christopherson
Date: Tue Sep 12 2023 - 13:07:29 EST


On Tue, Sep 12, 2023, Haitao Shan wrote:
> This issue exists in kernel version 6.3-rc1 or above. The issue is
> introduced by the commit 8e6ed96cdd50 ("KVM: x86: fire timer when it is
> migrated and expired, and in oneshot mode"). The issue occurs on Intel
> platform which APIC virtualization and posted interrupt processing.

I think the bug was actually introduced by:

967235d32032 ("KVM: vmx: clear pending interrupts on KVM_SET_LAPIC")

Fixing the "deadline <= 0" handling just made it much easier to be hit. E.g. if
the deadline was '1' during restore, set_target_expiration() would set tscdeadline
to T1+(1*N), where T1 is the current TSC and N is the multipler to get from nanoseconds
to cycles. start_sw_tscdeadline() (or vmx_set_hv_timer()) would then reread the
TSC (call it T2), see T2 > T1+(1*N), and mark the timer as expired.

> The issue is first discovered when running the Android Emulator which
> is based on QEMU 2.12. I can reproduce the issue with QEMU 8.0.4 in
> Debian 12.

The above is helpful as extra context, but repeating "This issue" and "The issue"
over and over without ever actually describing what the issue actualy is makes it
quite difficult to understand what is actually being fixed.

> With the above commit, the timer gets fired immediately inside the
> KVM_SET_LAPIC call when loading the snapshot. On such Intel platforms,
> this eventually leads to setting the corresponding PIR bit. However,
> the whole PIR bits get cleared later in the same KVM_SET_LAPIC call.
> Missing lapic timer interrupt freeze the guest kernel.

Please phrase changelogs as commands and state what is actually being changed.
Again, the context on what is broken is helpful, but the changelog really, really
needs to state what is being changed.

> Signed-off-by: Haitao Shan <hshan@xxxxxxxxxx>
> ---
> arch/x86/kvm/lapic.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index a983a16163b1..6f73406b875a 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2977,14 +2977,14 @@ int kvm_apic_set_state(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s)
> apic_update_lvtt(apic);
> apic_manage_nmi_watchdog(apic, kvm_lapic_get_reg(apic, APIC_LVT0));
> update_divide_count(apic);
> - __start_apic_timer(apic, APIC_TMCCT);
> - kvm_lapic_set_reg(apic, APIC_TMCCT, 0);
> kvm_apic_update_apicv(vcpu);
> if (apic->apicv_active) {
> static_call_cond(kvm_x86_apicv_post_state_restore)(vcpu);
> static_call_cond(kvm_x86_hwapic_irr_update)(vcpu, apic_find_highest_irr(apic));
> static_call_cond(kvm_x86_hwapic_isr_update)(apic_find_highest_isr(apic));
> }
> + __start_apic_timer(apic, APIC_TMCCT);
> + kvm_lapic_set_reg(apic, APIC_TMCCT, 0);

I don't think this is the ordering we want. It currently works, but it subtly
"relies" on a few things:

1. That vmx_deliver_posted_interrupt() never "fails" when APICv is enabled,
i.e. never puts the interrupt in the IRR instead of the PIR.

2. The SVM, a.k.a. AVIC, doesn't ever sync from the IRR to a separate "hardware"
virtual APIC, because unlike VMX, SVM does set the bit in the IRR.

I doubt #2 will ever change simply because that's tied to how AVIC works, and #1
shouldn't actually break anything since the fallback path in vmx_deliver_interrupt()
needs to be self-sufficient, but I don't like that the code syncs from the IRR and
_then_ potentially modifies the IRR.

I also don't like doing additional APIC state restoration _after_ invoking the
post_state_restore() hook. Updating APICv in the middle of the restore flow is
going to be brittle and difficult to maintain, e.g. it won't be obvious what
needs to go before and what needs to go after.

IMO, vmx_apicv_post_state_restore() is blatantly broken. It is most definitely
not doing "post state restore" stuff, it's simply purging state, i.e. belongs in
a "pre state restore" hook.

So rather than shuffle around the timer code, I think we should instead add yet
another kvm_x86_ops hook, e.g. apicv_pre_state_restore(), and have initialize
the PI descriptor there.

Aha! And I think the new apicv_pre_state_restore() needs to be invoked even if
APICv is not active, because I don't see anything that purges the PIR when APICv
is enabled. VMX's APICv doesn't have many inhibits that can go away, and I
highly doubt userspace will restore into a vCPU with pending posted interrupts,
so in practice this is _extremely_ unlikely to be problematic. But it's still
wrong.