Re: [PATCH] drm/xe/d3cold: Set power state to D3Cold during s2idle/s3
From: Raag Jadav
Date: Wed Apr 02 2025 - 23:12:59 EST
Cc: PCI maintainers
On Wed, Apr 02, 2025 at 12:31:48PM +0200, Rafael J. Wysocki wrote:
> On Wed, Apr 2, 2025 at 10:34 AM Raag Jadav <raag.jadav@xxxxxxxxx> wrote:
> >
> > On Wed, Apr 02, 2025 at 10:31:16AM +0300, Raag Jadav wrote:
> > > On Tue, Apr 01, 2025 at 09:35:49PM +0200, Rafael J. Wysocki wrote:
> > > > On Tue, Apr 1, 2025 at 7:53 PM Raag Jadav <raag.jadav@xxxxxxxxx> wrote:
> > >
> > > ...
> > >
> > > > > That's not what I meant here. There are multiple issues at play.
> > > > >
> > > > > 1. An AER is reported[*] on root port during system suspend even before we
> > > > > reach any of the driver PM callback. From initial investigation it seems
> > > > > like a case of misbehaviour by some child device, but this is a different
> > > > > issue entirely.
> > > > >
> > > > > [*] irq/120-aerdrv-145 [002] ..... 1264.981023: <stack trace>
> > > > > => xe_pm_runtime_resume
> > > > > => xe_pci_runtime_resume
> > > > > => pci_pm_runtime_resume
> > > > > => __rpm_callback
> > > > > => rpm_callback
> > > > > => rpm_resume
> > > > > => __pm_runtime_resume
> > > > > => pci_pm_runtime_get_sync
> > > > > => __pci_walk_bus
> > > > > => pci_walk_bus
> > > > > => pcie_do_recovery
> > > > > => aer_process_err_devices
> > > > > => aer_isr
> > > > >
> > > > > 2. Setting explicit pci_set_power_state(pdev, PCI_D3cold) from xe_pm_suspend().
> > > > > Although we see many drivers do it for their case, it's quite a questionable
> > > > > choice (atleast IMHO) to hard suspend the device from driver PM callback
> > > > > without any regard to runtime_usage counter. It can hide potential issues
> > > > > like AER during system suspend (regardless of whether or not it is supported
> > > > > by the driver, since it is supposed to keep the device active on such a
> > > > > catastrophic failure anyway), but I'll leave it to the experts to decide.
> > > >
> > > > If the driver does not set DPM_FLAG_SMART_SUSPEND, and xe doesn't set
> > > > it AFAICS (at least not in the mainline), pci_pm_suspend() will resume
> > > > the device from runtime suspend before invoking its driver's callback.
> > > >
> > > > This guarantees that the device is always RPM_ACTIVE when
> > > > xe_pci_suspend() runs and it cannot be runtime-suspended because the
> > > > core is holding a runtime PM reference on it (and so the runtime PM
> > > > usage counter is always greater than zero when xe_pci_suspend() runs).
> > > >
> > > > This means that neither xe_pci_runtime_suspend() nor
> > > > xe_pci_runtime_resume() can run concurrently with respect to it, so
> > > > xe_pci_suspend() can change the power state of the device etc. safely.
> > >
> > > Ah, I failed to notice that __pm_runtime_resume() is taking a spin_lock_irqsave().
> > > Thanks for clearing this up.
> >
> > On second thought, pcie_do_recovery() can still race with xe_pci_suspend(),
> > is this understanding correct?
>
> Yes, it can, but this is an AER issue.
>
> Apparently, somebody took runtime PM into account, but they failed to
> take system suspend into account.
>
> There are many drivers that do PCI PM in their ->suspend() callbacks
> and this predates pcie_do_recovery() AFAICS. Some of them don't even
> support runtime PM.
>
> > I'm assuming this is why it's much safer to do pci_save_state() and
> > pci_prepare_to_sleep() only in ->noirq() callbacks like originally done
> > by PCI PM, right?
>
> Not really, but close.
>
> The noirq phases were introduced because drivers often failed to
> prevent their interrupt handlers from messing up with devices after
> powering them down.
>
> Recovery is kind of like hot-add, doing any of them during system
> suspend is a bad idea because the outcome is hard to predict.
>
> AER needs to be fixed.
I agree it's a bad idea and should be fixed but even more unpredictable
in such cases is resuming the device afterwards, which may or may not
succeed depending on the fault that has happened. So perhaps just not
let the device suspend to D3 at all?
Raag