Re: [PATCH] KVM: x86: Fix lapic timer interrupt lost after loading a snapshot.

From: Haitao Shan
Date: Tue Sep 12 2023 - 14:16:21 EST


Sorry for sending the reply twice. The former reply got rejected since
I forgot to turn on the plain text email setting.

On Tue, Sep 12, 2023 at 10:07 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Tue, Sep 12, 2023, Haitao Shan wrote:
> > This issue exists in kernel version 6.3-rc1 or above. The issue is
> > introduced by the commit 8e6ed96cdd50 ("KVM: x86: fire timer when it is
> > migrated and expired, and in oneshot mode"). The issue occurs on Intel
> > platform which APIC virtualization and posted interrupt processing.
>
> I think the bug was actually introduced by:
>
> 967235d32032 ("KVM: vmx: clear pending interrupts on KVM_SET_LAPIC")
Thanks for pointing this out. I know commit 8e6ed96cdd50 is only a
trigger. But I
did not go one step further and find out where the bug is coming from.
>
> Fixing the "deadline <= 0" handling just made it much easier to be hit. E.g. if
> the deadline was '1' during restore, set_target_expiration() would set tscdeadline
> to T1+(1*N), where T1 is the current TSC and N is the multipler to get from nanoseconds
> to cycles. start_sw_tscdeadline() (or vmx_set_hv_timer()) would then reread the
> TSC (call it T2), see T2 > T1+(1*N), and mark the timer as expired.
>
> > The issue is first discovered when running the Android Emulator which
> > is based on QEMU 2.12. I can reproduce the issue with QEMU 8.0.4 in
> > Debian 12.
>
> The above is helpful as extra context, but repeating "This issue" and "The issue"
> over and over without ever actually describing what the issue actualy is makes it
> quite difficult to understand what is actually being fixed.
Got it. I will rewrite the whole commit message for v2.
>
> > With the above commit, the timer gets fired immediately inside the
> > KVM_SET_LAPIC call when loading the snapshot. On such Intel platforms,
> > this eventually leads to setting the corresponding PIR bit. However,
> > the whole PIR bits get cleared later in the same KVM_SET_LAPIC call.
> > Missing lapic timer interrupt freeze the guest kernel.
>
> Please phrase changelogs as commands and state what is actually being changed.
> Again, the context on what is broken is helpful, but the changelog really, really
> needs to state what is being changed.
Will do.
>
> > Signed-off-by: Haitao Shan <hshan@xxxxxxxxxx>
> > ---
> > arch/x86/kvm/lapic.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index a983a16163b1..6f73406b875a 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -2977,14 +2977,14 @@ int kvm_apic_set_state(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s)
> > apic_update_lvtt(apic);
> > apic_manage_nmi_watchdog(apic, kvm_lapic_get_reg(apic, APIC_LVT0));
> > update_divide_count(apic);
> > - __start_apic_timer(apic, APIC_TMCCT);
> > - kvm_lapic_set_reg(apic, APIC_TMCCT, 0);
> > kvm_apic_update_apicv(vcpu);
> > if (apic->apicv_active) {
> > static_call_cond(kvm_x86_apicv_post_state_restore)(vcpu);
> > static_call_cond(kvm_x86_hwapic_irr_update)(vcpu, apic_find_highest_irr(apic));
> > static_call_cond(kvm_x86_hwapic_isr_update)(apic_find_highest_isr(apic));
> > }
> > + __start_apic_timer(apic, APIC_TMCCT);
> > + kvm_lapic_set_reg(apic, APIC_TMCCT, 0);
>
> I don't think this is the ordering we want. It currently works, but it subtly
> "relies" on a few things:
>
> 1. That vmx_deliver_posted_interrupt() never "fails" when APICv is enabled,
> i.e. never puts the interrupt in the IRR instead of the PIR.
>
> 2. The SVM, a.k.a. AVIC, doesn't ever sync from the IRR to a separate "hardware"
> virtual APIC, because unlike VMX, SVM does set the bit in the IRR.
>
> I doubt #2 will ever change simply because that's tied to how AVIC works, and #1
> shouldn't actually break anything since the fallback path in vmx_deliver_interrupt()
> needs to be self-sufficient, but I don't like that the code syncs from the IRR and
> _then_ potentially modifies the IRR.
>
> I also don't like doing additional APIC state restoration _after_ invoking the
> post_state_restore() hook. Updating APICv in the middle of the restore flow is
> going to be brittle and difficult to maintain, e.g. it won't be obvious what
> needs to go before and what needs to go after.
>
> IMO, vmx_apicv_post_state_restore() is blatantly broken. It is most definitely
> not doing "post state restore" stuff, it's simply purging state, i.e. belongs in
> a "pre state restore" hook.
>
> So rather than shuffle around the timer code, I think we should instead add yet
> another kvm_x86_ops hook, e.g. apicv_pre_state_restore(), and have initialize
> the PI descriptor there.
>
> Aha! And I think the new apicv_pre_state_restore() needs to be invoked even if
> APICv is not active, because I don't see anything that purges the PIR when APICv
> is enabled. VMX's APICv doesn't have many inhibits that can go away, and I
> highly doubt userspace will restore into a vCPU with pending posted interrupts,
> so in practice this is _extremely_ unlikely to be problematic. But it's still
> wrong.
Thanks for sharing what you would like to fix the bug. I will write a
v2 for that.
Actually, I am sorry that I forgot to add RFC to the title, as I
personally did not think
the proposed fix looks clean. I am surprised that
apic_post_state_restore actually
clears the whole PIR which looks like "resetting" instead of
"restoring". However,
I am not sure whether this is the exact intended behavior and code sequence. If
it is intended, __restart_apic_timer should defer its interrupt
delivery action after
apic_post_state_restore (something like raising a request for updating PIR when
vcpu_enter_guest).

I will work on v2 following your suggestion. Thanks!

--
Haitao @Google