Re: There are not enough CPU0 APIC IRQs while doing IRQ migration during S3

From: Kai-Heng Feng
Date: Wed Jul 20 2022 - 23:02:29 EST


On Wed, Jul 20, 2022 at 6:11 PM Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote:
>
> On Wed, Jul 20, 2022 at 5:16 AM Kai-Heng Feng
> <kai.heng.feng@xxxxxxxxxxxxx> wrote:
> >
> > [+Cc Rafael, linux-pm]
> >
> > On Wed, Jul 20, 2022 at 10:53 AM AceLan Kao <acelan.kao@xxxxxxxxxxxxx> wrote:
> > >
> > > Marc Zyngier <maz@xxxxxxxxxx> 於 2022年7月19日 週二 下午6:48寫道:
> > > >
> > > > [- Jason]
> > > >
> > > > On Tue, 19 Jul 2022 06:55:21 +0100,
> > > > AceLan Kao <acelan.kao@xxxxxxxxxxxxx> wrote:
> > > > >
> > > > > HI all,
> > > > >
> > > > > I encountered an issue while doing S3, it shows below message and then
> > > > > failed to enter S3
> > > > > [ 106.731140] CPU 31 has 116 vectors, 85 available. Cannot disable CPU
> > > > > [ 106.731551] ACPI: \_PR_.C01F: Found 2 idle states
> > > > > [ 106.732610] Error taking CPU31 down: -28
> > > > > [ 106.732612] Non-boot CPUs are not disabled
> > > > >
> > > > > CPU: AMD Ryzen Threadripper PRO 3955WX 16-Cores
> > > > > Kernel: v5.19-rc7
> > > > > There are 5 PCI to 4 type-c ports USB cards on the machine, and It
> > > > > wouldn't lead to the issue if only 4 cards are plugged. So, it looks
> > > > > like it can't handle 5 cards, and failed on the IRQ migration.
> > > > >
> > > > > The workaround provided by kaiheng is to release the irq while
> > > > > suspending and request irq while resuming.
> > > > > I'm wondering do we have a better solution for this kind of issue?
> > > > > Thanks.
> > > > >
> > > > > diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
> > > > > index edc6881c8a1b..91c79b21cb57 100644
> > > > > --- a/drivers/usb/host/xhci.c
> > > > > +++ b/drivers/usb/host/xhci.c
> > > > > @@ -17,6 +17,7 @@
> > > > > #include <linux/slab.h>
> > > > > #include <linux/dmi.h>
> > > > > #include <linux/dma-mapping.h>
> > > > > +#include <linux/suspend.h>
> > > > >
> > > > > #include "xhci.h"
> > > > > #include "xhci-trace.h"
> > > > > @@ -1079,6 +1080,9 @@ int xhci_suspend(struct xhci_hcd *xhci, bool do_wakeup)
> > > > > __func__);
> > > > > }
> > > > >
> > > > > + if (pm_suspend_via_firmware())
> > > > > + xhci_cleanup_msix(xhci);
> > > >
> > > > I'm a bit clueless when it comes to the combination of x86 and xhci,
> > > > but doesn't this prevent resuming on a xhci interrupt?
> > > The PCI cards provide 4 type-c USB ports, and in the beginning we
> > > found that removing one PCI card fixed the issue, so we were trying to
> > > fix the issue in xhci driver.
> > > The USB ports on the PCI cards can't resume the system from S3 even
> > > without the workaround,
> > > but the USB ports on the rear panel of the motherboard still work with
> > > the workaround.
> >
> > The isn't xHCI specific. The issue here is that CPU0 APIC doesn't have
> > enough IRQ vector for ACPI S3 suspend.
> > Ideally we don't want to tear down IRQs during suspend, but for this
> > case minimizing IRQ numbers means successful S3.
> >
> > So maybe we can have a suspend flow like this:
> > - At the beginning of suspend, check if there's enough free IRQ for
> > CPU0 migration.
> > - If there isn't enough free slots, hint drivers to tear down non-wake
> > IRQs. Maybe use a global variable if we don't want to add a new
> > parameter to current PM ops.
> > - If it's still not enough, abort suspend.
> >
> > For suspend that doesn't unplug CPU like suspend-to-idle, no
> > modification is needed.
> > I wonder if that makes sense?
>
> Quite probably, IRQs need not be migrated during system suspend, so it
> should be possible to avoid doing that entirely on "hot remove" if it
> is part of the suspend flow.

So is it required to disable all CPUs except CPU0 for ACPI S3?
ACPI spec rev 6.3, "7.4.2.4 System \_S3 State" says "The processors
are not executing instructions. The processor-complex context is not
maintained." it doesn't say the CPUs have to be disabled.
So as long as CPUs are quiesced, is pm_sleep_disable_secondary_cpus()
still needed for S3?

Kai-Heng