Re: [PATCH] KVM/x86: Do not clear SIPI while in SMM
From: Igor Mammedov
Date: Tue Oct 01 2024 - 04:18:44 EST
On Mon, 30 Sep 2024 16:34:57 -0700
Eric Mackay <eric.mackay@xxxxxxxxxx> wrote:
> > On Thu, 26 Sep 2024 18:22:39 -0700
> > Eric Mackay <eric.mackay@xxxxxxxxxx> wrote:
> > > > On 9/24/24 5:40 AM, Igor Mammedov wrote:
> > > >> On Fri, 19 Apr 2024 12:17:01 -0400
> > > >> boris.ostrovsky@xxxxxxxxxx wrote:
> > > >>
> > > >>> On 4/17/24 9:58 AM, boris.ostrovsky@xxxxxxxxxx wrote:
> > > >>>>
> > > >>>> I noticed that I was using a few months old qemu bits and now I am
> > > >>>> having trouble reproducing this on latest bits. Let me see if I can get
> > > >>>> this to fail with latest first and then try to trace why the processor
> > > >>>> is in this unexpected state.
> > > >>>
> > > >>> Looks like 012b170173bc "system/qdev-monitor: move drain_call_rcu call
> > > >>> under if (!dev) in qmp_device_add()" is what makes the test to stop failing.
> > > >>>
> > > >>> I need to understand whether lack of failures is a side effect of timing
> > > >>> changes that simply make hotplug fail less likely or if this is an
> > > >>> actual (but seemingly unintentional) fix.
> > > >>
> > > >> Agreed, we should find out culprit of the problem.
> > > >
> > > >
> > > > I haven't been able to spend much time on this unfortunately, Eric is
> > > > now starting to look at this again.
> > > >
> > > > One of my theories was that ich9_apm_ctrl_changed() is sending SMIs to
> > > > vcpus serially while on HW my understanding is that this is done as a
> > > > broadcast so I thought this could cause a race. I had a quick test with
> > > > pausing and resuming all vcpus around the loop but that didn't help.
> > > >
> > > >
> > > >>
> > > >> PS:
> > > >> also if you are using AMD host, there was a regression in OVMF
> > > >> where where vCPU that OSPM was already online-ing, was yanked
> > > >> from under OSMP feet by OVMF (which depending on timing could
> > > >> manifest as lost SIPI).
> > > >>
> > > >> edk2 commit that should fix it is:
> > > >> https://github.com/tianocore/edk2/commit/1c19ccd5103b
> > > >>
> > > >> Switching to Intel host should rule that out at least.
> > > >> (or use fixed edk2-ovmf-20240524-5.el10.noarch package from centos,
> > > >> if you are forced to use AMD host)
> > >
> > > I haven't been able to reproduce the issue on an Intel host thus far,
> > > but it may not be an apples-to-apples comparison because my AMD hosts
> > > have a much higher core count.
> > >
> > > >
> > > > I just tried with latest bits that include this commit and still was
> > > > able to reproduce the problem.
> > > >
> > > >
> > > >-boris
> > >
> > > The initial hotplug of each CPU appears to complete from the
> > > perspective of OVMF and OSPM. SMBASE relocation succeeds, and the new
> > > CPU reports back from the pen. It seems to be the later INIT-SIPI-SIPI
> > > sequence sent from the guest that doesn't complete.
> > >
> > > My working theory has been that some CPU/AP is lagging behind the others
> > > when the BSP is waiting for all the APs to go into SMM, and the BSP just
> > > gives up and moves on. Presumably the INIT-SIPI-SIPI is sent while that
> > > CPU does finally go into SMM, and other CPUs are in normal mode.
> > >
> > > I've been able to observe the SMI handler for the problematic CPU will
> > > sometimes start running when no BSP is elected. This means we have a
> > > window of time where the CPU will ignore SIPI, and least 1 CPU is in
> > > normal mode (the BSP) which is capable of sending INIT-SIPI-SIPI from
> > > the guest.
> >
> > I've re-read whole thread and noticed Boris were saying:
> > > On Tue, Apr 16, 2024 at 10:57 PM <boris.ostrovsky@xxxxxxxxxx> wrote:
> > > > On 4/16/24 4:53 PM, Paolo Bonzini wrote:
> > ...
> > > > >
> > > > > What is the reproducer for this?
> > > >
> > > > Hotplugging/unplugging cpus in a loop, especially if you oversubscribe
> > > > the guest, will get you there in 10-15 minutes.
> > ...
> >
> > So there was unplug involved as well, which was broken since forever.
> >
> > Recent patch
> > https://patchew.org/QEMU/20230427211013.2994127-1-alxndr@xxxxxx/20230427211013.2994127-2-alxndr@xxxxxx/
> > has exposed issue (unexpected uplug/unplug flow) with root cause in OVMF.
> > Firmware was letting non involved APs run wild in normal mode.
> > As result AP that was calling _EJ0 and holding ACPI lock was
> > continuing _EJ0 and releasing ACPI lock, while BSP and a being removed
> > CPU were still in SMM world. And any other plug/unplug op
> > were able to grab ACPI lock and trigger another SMI, which breaks
> > hotplug flow expectations (aka exclusive access to hotplug registers
> > during plug/unplug op)
> > Perhaps that's what you are observing.
> >
> > Please check if following helps:
> > https://github.com/kraxel/edk2/commit/738c09f6b5ab87be48d754e62deb72b767415158
> >
>
> I haven't actually seen the guest crash during unplug, though certainly
> there have been unplug failures. I haven't been keeping track of the
> unplug failures as closely, but a test I ran over the weekend with this
> patch added seemed to show less unplug failures.
it's not only about unplug, unfortunately.
QEMU that includes Alexander's patch, essentially denies access to hotplug
registers if unplug is in process. So if there is hotplug going at the same
time, it may be broken by that access deny.
To exclude this issue, you need to test with edk2 fix or use older QEMU
without Alexander's patch.
> I'm still getting hotplug failures that cause a guest crash though, so
> that mystery remains.
>
> > So yes, SIPI can be lost (which should be expected as others noted)
> > but that normally shouldn't be an issue as wakeup_secondary_cpu_via_init()
> > do resend SIPI.
> > However if wakeup_secondary_cpu is set to another handler that doesn't
> > resend SIPI, It might be an issue.
>
> We're using wakeup_secondary_cpu_via_init(). acpi_wakeup_cpu() and
> wakeup_cpu_via_vmgexit(), for example, are a bit opaque to me, so I'm
> not sure if those code paths include a SIPI resend.
wakeup_secondary_cpu_via_init() should re-send SIPI.
If you can reproduce with KVM tracing and guest kernel debug enabled,
I'd try to do that and check if SIPI are being re-sent or not.
That at least should give a hint if we should look at guest side or at KVM/QEMU.