Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user

From: manivannan . sadhasivam
Date: Tue Dec 17 2024 - 00:30:05 EST


On Mon, Dec 16, 2024 at 08:34:24PM +0100, Rafael J. Wysocki wrote:

[...]

> > There is also a case where some devices like
> > (Laptops made out of Qcom SCX Gen3 SoCs) require all the PCIe devices to be
> > powered down in order for the SoC to reach its low power state (CX power
> > collapse in Qcom terms). If not, the SoC would continue to draw more power
> > leading to battery draining quickly. This platform is supported in upstream and
> > we keep the PCIe interconnect voted during suspend as the NVMe driver is
> > expecting the device to retain its state during resume. Because of this
> > requirement, this platform is not reaching SoC level low power state with
> > upstream kernel.
>
> OK, now all of this makes sense and that's why you really want NVMe
> devices to end up in some form of PCI D3 in suspend-to-idle.
>
> Would D3hot be sufficient for this platform or does it need to be
> D3cold? If the latter, what exactly is the method by which they are
> put into D3cold?
>

D3Cold is what preferred. Earlier the controller driver used to transition the
device into D3Cold by sending PME_Turn_Off, turning off refclk, regulators
etc... Now we have a new framework called 'pwrctrl' that handles the
clock/regulators supplied to the device. So both controller and pwrctrl drivers
need to work in a tandem to put the device into D3Cold.

Once the PCIe client driver (NVMe in this case) powers down the device, then
controller/pwrctrl drivers will check the PCIe device state and transition the
device into D3Cold. This is a TODO btw.

But right now there is no generic D3Cold handling available for DT platforms. I
am hoping to fix that too once this NVMe issue is handled.

> > > > > > If the platform is using DT, then there is no entity setting
> > > > > > pm_set_suspend_via_firmware().
> > > > >
> > > > > That's true and so the assumption is that in this case the handling of
> > > > > all devices will always be the same regardless of which flavor of
> > > > > system suspend is chosen by user space.
> > > > >
> > > >
> > > > Right and that's why the above concern is raised.
> > >
> > > And it is still unclear to me what the problem with it is.
> > >
> > > What exactly can go wrong?
> > >
> > > > > > So NVMe will be kept in low power state all the
> > > > > > time (still draining the battery).
> > > > >
> > > > > So what would be the problem with powering it down unconditionally?
> > > > >
> > > >
> > > > I'm not sure how would you do that (by checking dev_of_node()?). Even so, it
> > > > will wear out the NVMe devices if used in Android tablets etc...
> > >
> > > I understand the wear-out concern.
> > >
> > > Is there anything else?
> > >
> >
> > No, that's the only concern.
>
> OK
>
> I think we're getting to the bottom of the issue.
>
> First off, there really is no requirement to avoid putting PCI devices
> into D3hot or D3cold during suspend-to-idle. On all modern Intel
> platforms, PCIe devices are put into D3(hot or cold) during
> suspend-to-idle and I don't see why this should be any different on
> platforms from any other vendors.
>
> The NVMe driver is an exception and it avoids D3(hot or cold) during
> suspend-to-idle because of problems with some hardware or platforms.
> It might in principle allow devices to go into D3(hot or cold) during
> suspend-to-idle, so long as it knows that this is going to work.
>

Slight correction here.

NVMe driver avoids PCI PM _only_ when it wants to handle the NVMe power
state on its own, not all the time. It has some checks [1] in the suspend path
and if the platform/device passes one of the checks, it will power down the
device.

DT platforms doesn't pass any of the checks, so the NVMe driver always manages
the power state on its own. Unfortunately, the resultant power saving is not
enough, so the vendors (Laptop/Automotive) using DT want NVMe to be powered down
all the time. This is the first problem we are trying to solve.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvme/host/pci.c#n3448

> However, there is an additional concern that putting an NVMe device
> into D3cold every time during system suspend on Android might cause it
> to wear out more quickly.
>

Right, this is the second problem.

> Is there anything else?

We also need to consider the fact that the firmware in some platforms doesn't
support S2R. So we cannot take a decision based on S2I/S2R alone.

I think there are atleast a couple of ways to solve above mentioned problems:

1. Go extra mile, take account of all issues/limitations mentioned above and
come up with an API. This API will provide a sane default suspend behavior to
drivers (fixing first problem) and will also allow changing the behavior
dynamically (to fix the second problem where kernel cannot distinguish Android
vs other OS).

2. Allow DT platforms to call pm_set_suspend_via_firmware() and make use of the
existing behavior in the NVMe driver. This would only solve the first problem
though. But would be a good start.

- Mani

--
மணிவண்ணன் சதாசிவம்