Re: [PATCH v2 2/2] PCI: Fix runtime PM race with PME polling

From: Alex Williamson
Date: Tue Jan 23 2024 - 11:51:23 EST


On Tue, 23 Jan 2024 17:12:39 +0100
Lukas Wunner <lukas@xxxxxxxxx> wrote:

> On Tue, Jan 23, 2024 at 08:55:21AM -0700, Alex Williamson wrote:
> > On Tue, 23 Jan 2024 11:45:19 +0100 Lukas Wunner <lukas@xxxxxxxxx> wrote:
> > > If the device is RPM_SUSPENDING, why immediately resume it for polling?
> > > It's sufficient to poll it the next time around, i.e. 1 second later.
> > >
> > > Likewise, if it's already RPM_RESUMING or RPM_ACTIVE anyway, no need
> > > to poll PME.
> >
> > I'm clearly not an expert on PME, but this is not obvious to me and
> > before the commit that went in through this thread, PME wakeup was
> > triggered regardless of the PM state. I was trying to restore the
> > behavior of not requiring a specific PM state other than deferring
> > polling across transition states.
>
> There are broken devices which are incapable of signaling PME.
> As a workaround, the kernel polls these devices once per second.
> The first time the device signals PME, the kernel stops polling
> that particular device because PME is clearly working.
>
> So this is just a best-effort way to support PME for broken devices.
> If it takes a little longer to detect that PME was signaled, it's not
> a big deal.
>
> > The issue I'm trying to address is that config space of the device can
> > become inaccessible while calling pci_pme_wakeup() on it, causing a
> > system fault on some hardware. So a gratuitous pci_pme_wakeup() can be
> > detrimental.
> >
> > We require the device config space to remain accessible, therefore the
> > instantaneous test against D3cold and that the parent bridge is in D0
> > is not sufficient. I see traces where the parent bridge is in D0, but
> > the PM state is RPM_SUSPENDING and the endpoint device transitions to
> > D3cold while we're executing pci_pme_wakeup().
>
> We have pci_config_pm_runtime_{get,put}() helpers to ensure the parent
> of a device is in D0 so that the device's config space is accessible.
> So you may need to use that in pci_pme_wakeup().

pci_config_pm_runtime_get() doesn't seem to align with our current
philosophy to defer polling devices that aren't in the correct power
state. We require the bridge to be in D0, but we defer polling rather
than resume it otherwise. We also defer device polling if the device
is in D3cold, whereas the above function would resume a device in that
state. I think our bridge D0 test could be reliable if it were done
holding a reference acquired via pm_runtime_get_if_active(). Thanks,

Alex