Re: [PATCH] drm/xe/d3cold: Set power state to D3Cold during s2idle/s3

From: Rafael J. Wysocki
Date: Thu Apr 03 2025 - 07:12:30 EST


On Thu, Apr 3, 2025 at 5:12 AM Raag Jadav <raag.jadav@xxxxxxxxx> wrote:
>
> Cc: PCI maintainers
>
> On Wed, Apr 02, 2025 at 12:31:48PM +0200, Rafael J. Wysocki wrote:
> > On Wed, Apr 2, 2025 at 10:34 AM Raag Jadav <raag.jadav@xxxxxxxxx> wrote:
> > >
> > > On Wed, Apr 02, 2025 at 10:31:16AM +0300, Raag Jadav wrote:
> > > > On Tue, Apr 01, 2025 at 09:35:49PM +0200, Rafael J. Wysocki wrote:
> > > > > On Tue, Apr 1, 2025 at 7:53 PM Raag Jadav <raag.jadav@xxxxxxxxx> wrote:
> > > >
> > > > ...
> > > >
> > > > > > That's not what I meant here. There are multiple issues at play.
> > > > > >
> > > > > > 1. An AER is reported[*] on root port during system suspend even before we
> > > > > > reach any of the driver PM callback. From initial investigation it seems
> > > > > > like a case of misbehaviour by some child device, but this is a different
> > > > > > issue entirely.
> > > > > >
> > > > > > [*] irq/120-aerdrv-145 [002] ..... 1264.981023: <stack trace>
> > > > > > => xe_pm_runtime_resume
> > > > > > => xe_pci_runtime_resume
> > > > > > => pci_pm_runtime_resume
> > > > > > => __rpm_callback
> > > > > > => rpm_callback
> > > > > > => rpm_resume
> > > > > > => __pm_runtime_resume
> > > > > > => pci_pm_runtime_get_sync
> > > > > > => __pci_walk_bus
> > > > > > => pci_walk_bus
> > > > > > => pcie_do_recovery
> > > > > > => aer_process_err_devices
> > > > > > => aer_isr
> > > > > >
> > > > > > 2. Setting explicit pci_set_power_state(pdev, PCI_D3cold) from xe_pm_suspend().
> > > > > > Although we see many drivers do it for their case, it's quite a questionable
> > > > > > choice (atleast IMHO) to hard suspend the device from driver PM callback
> > > > > > without any regard to runtime_usage counter. It can hide potential issues
> > > > > > like AER during system suspend (regardless of whether or not it is supported
> > > > > > by the driver, since it is supposed to keep the device active on such a
> > > > > > catastrophic failure anyway), but I'll leave it to the experts to decide.
> > > > >
> > > > > If the driver does not set DPM_FLAG_SMART_SUSPEND, and xe doesn't set
> > > > > it AFAICS (at least not in the mainline), pci_pm_suspend() will resume
> > > > > the device from runtime suspend before invoking its driver's callback.
> > > > >
> > > > > This guarantees that the device is always RPM_ACTIVE when
> > > > > xe_pci_suspend() runs and it cannot be runtime-suspended because the
> > > > > core is holding a runtime PM reference on it (and so the runtime PM
> > > > > usage counter is always greater than zero when xe_pci_suspend() runs).
> > > > >
> > > > > This means that neither xe_pci_runtime_suspend() nor
> > > > > xe_pci_runtime_resume() can run concurrently with respect to it, so
> > > > > xe_pci_suspend() can change the power state of the device etc. safely.
> > > >
> > > > Ah, I failed to notice that __pm_runtime_resume() is taking a spin_lock_irqsave().
> > > > Thanks for clearing this up.
> > >
> > > On second thought, pcie_do_recovery() can still race with xe_pci_suspend(),
> > > is this understanding correct?
> >
> > Yes, it can, but this is an AER issue.
> >
> > Apparently, somebody took runtime PM into account, but they failed to
> > take system suspend into account.
> >
> > There are many drivers that do PCI PM in their ->suspend() callbacks
> > and this predates pcie_do_recovery() AFAICS. Some of them don't even
> > support runtime PM.
> >
> > > I'm assuming this is why it's much safer to do pci_save_state() and
> > > pci_prepare_to_sleep() only in ->noirq() callbacks like originally done
> > > by PCI PM, right?
> >
> > Not really, but close.
> >
> > The noirq phases were introduced because drivers often failed to
> > prevent their interrupt handlers from messing up with devices after
> > powering them down.
> >
> > Recovery is kind of like hot-add, doing any of them during system
> > suspend is a bad idea because the outcome is hard to predict.
> >
> > AER needs to be fixed.
>
> I agree it's a bad idea and should be fixed but even more unpredictable
> in such cases is resuming the device afterwards, which may or may not
> succeed depending on the fault that has happened. So perhaps just not
> let the device suspend to D3 at all?

Yes, it is better to leave the device in question in its current power
state in that case.