Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
From: Rafael J. Wysocki
Date: Thu Dec 19 2024 - 07:46:30 EST
On Thu, Dec 19, 2024 at 9:02 AM Manivannan Sadhasivam
<manivannan.sadhasivam@xxxxxxxxxx> wrote:
>
> On Tue, Dec 17, 2024 at 08:45:55PM +0100, Rafael J. Wysocki wrote:
> > On Tue, Dec 17, 2024 at 6:26 AM <manivannan.sadhasivam@xxxxxxxxxx> wrote:
> > >
> > > On Mon, Dec 16, 2024 at 08:34:24PM +0100, Rafael J. Wysocki wrote:
> > >
> > > [...]
> > >
> > > > > There is also a case where some devices like
> > > > > (Laptops made out of Qcom SCX Gen3 SoCs) require all the PCIe devices to be
> > > > > powered down in order for the SoC to reach its low power state (CX power
> > > > > collapse in Qcom terms). If not, the SoC would continue to draw more power
> > > > > leading to battery draining quickly. This platform is supported in upstream and
> > > > > we keep the PCIe interconnect voted during suspend as the NVMe driver is
> > > > > expecting the device to retain its state during resume. Because of this
> > > > > requirement, this platform is not reaching SoC level low power state with
> > > > > upstream kernel.
> > > >
> > > > OK, now all of this makes sense and that's why you really want NVMe
> > > > devices to end up in some form of PCI D3 in suspend-to-idle.
> > > >
> > > > Would D3hot be sufficient for this platform or does it need to be
> > > > D3cold? If the latter, what exactly is the method by which they are
> > > > put into D3cold?
> > > >
> > >
> > > D3Cold is what preferred. Earlier the controller driver used to transition the
> > > device into D3Cold by sending PME_Turn_Off, turning off refclk, regulators
> > > etc... Now we have a new framework called 'pwrctrl' that handles the
> > > clock/regulators supplied to the device. So both controller and pwrctrl drivers
> > > need to work in a tandem to put the device into D3Cold.
> > >
> > > Once the PCIe client driver (NVMe in this case) powers down the device, then
> > > controller/pwrctrl drivers will check the PCIe device state and transition the
> > > device into D3Cold. This is a TODO btw.
> > >
> > > But right now there is no generic D3Cold handling available for DT platforms. I
> > > am hoping to fix that too once this NVMe issue is handled.
> >
> > There's no generic D3cold handling for PCIe devices at all AFAICS. At
> > least, I'm not aware of any standard way to do it. Yes, there are
> > vendor-specific conventions that may even be followed by multiple
> > vendors, but not much beyond that.
> >
>
> Yeah, right. Atleast ACPI has its own way of handling D3Cold and that's what I
> meant.
Well, not so simple.
ACPI has a way to put devices into D3cold, but it is based on power
resources: If all of the ACPI power resources for state D3hot of the
given device are turned off, the device is assumed to have no power
which effectively means D3cold.
Now, this basically works for platform devices, but it doesn't work
particularly well for PCIe devices that get their power from the bus.
This is addressed by using a conventional approach that has not been
codified as a standard so far AFAICS. The convention is to define a
D3hot power resource for the parent PCIe port of the device in
question and turn that power resource off when both the device and its
parent port have been programmed to go into D3hot (via PMCSR). So the
flow is that the device goes into D3hot first. Later, its parent port
goes into D3hot and then the D3hot power resource of the port is
turned off. It usually doesn't really remove power from the port
(especially if this is a root port), but it turns down the PCIe link
from it to the device (in a platform-specific way) which effectively
cuts power from the device.
On the way back, the D3hot power resource of the port is turned on
first and this turns the PCIe link between the port and the device up.
> There is no such option available for DT right now. I was shoping that
> once this NVMe issue is resolved, then I could look into D3Cold for DT
> platforms.
Well, consider the above.
[cut]
> > > >
> > > > I think we're getting to the bottom of the issue.
> > > >
> > > > First off, there really is no requirement to avoid putting PCI devices
> > > > into D3hot or D3cold during suspend-to-idle. On all modern Intel
> > > > platforms, PCIe devices are put into D3(hot or cold) during
> > > > suspend-to-idle and I don't see why this should be any different on
> > > > platforms from any other vendors.
> > > >
> > > > The NVMe driver is an exception and it avoids D3(hot or cold) during
> > > > suspend-to-idle because of problems with some hardware or platforms.
> > > > It might in principle allow devices to go into D3(hot or cold) during
> > > > suspend-to-idle, so long as it knows that this is going to work.
> > > >
> > >
> > > Slight correction here.
> > >
> > > NVMe driver avoids PCI PM _only_ when it wants to handle the NVMe power
> > > state on its own, not all the time. It has some checks [1] in the suspend path
> > > and if the platform/device passes one of the checks, it will power down the
> > > device.
> >
> > Yes, there is a comment in that code explaining what's going on and
> > that is basically "if key ingredients are missing or the firmware is
> > going to do its thing anyway, don't bother".
> >
> > > DT platforms doesn't pass any of the checks, so the NVMe driver always manages
> > > the power state on its own. Unfortunately, the resultant power saving is not
> > > enough, so the vendors (Laptop/Automotive) using DT want NVMe to be powered down
> > > all the time. This is the first problem we are trying to solve.
> > >
> > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvme/host/pci.c#n3448
> >
> > I see.
> >
> > This cannot be done by the driver itself, though, at least not in
> > general. The PCI layer needs to be involved and, if we are talking
> > about D3cold, the platform firmware needs to be involved either.
> >
>
> Right, but the device driver needs to have some idea about what state PCI core
> is going to choose for the device. I believe that's the purpose of
> pci_choose_state() API. More below...
This is not really straightforward as per the above.
The driver would effectively need to know if the device's parent port
would be put into D3cold and pci_choose_state() is only about the
device itself.
> > As a rule, the PCI layer reaches out to the platform firmware for help
> > as needed and drivers don't take part in this directly.
> >
> > The NVMe driver would need to let the PCI layer take over and set the
> > PCI power state of the device if it wanted to get any deeper than NVMe
> > protocol specific power states.
> >
> > In principle, this can be addressed with some kind of a driver opt-in.
> > That is, the NVMe driver would continue to work the way it does, but
> > instead of completely preventing the PCI layer from taking over, it
> > would opt in (the exact opt-in mechanism is TBD) for D3cold if (a) the
> > platform firmware provides a mechanism to do so and (b) the DT
> > indicates that this mechanism should be used for the given device.
> >
>
> Ok, IIUC you are talking about something like this?
>
> static int nvme_suspend(struct device *dev)
> {
> ...
>
> if (pm_suspend_via_firmware() || !ctrl->npss || ... ||
> pci_choose_state(pdev, PMSG_SUSPEND) == PCI_D3cold)
> return nvme_disable_prepare_reset(ndev, true);
>
> /* continue using protocol specific low power state */
>
> ...
> }
>
> Here, pci_choose_state() should tell the driver if the device should enter
> D3Cold. ACPI already supports this API, now I need to add DT support (which is
> not straightforward, but doable). Since this API is already exported, I think it
> makes perfect sense to use it here (and other drivers for similar usecase).
I didn't really think about using pci_choose_state() here, but it is
not unreasonable IMV. It may in principle be extended to cover the
"port + device combo" approach described above. I'd rather avoid
calling pci_choose_state() twice for the same device during one
transition (once in the driver and then in the PCI layer), but that
could be addressed by rearranging the code.
However, I thought about adding a way for the driver to effectively
say "I'm fine with putting this device into power state X so long as
you have an indication from the platform firmware that this is OK",
where X would be the deepest power state allowed by the driver.
The drawback of this is that the driver would not know whether or not
the device would go into state X upfront, so it would need to do its
own thing in case this doesn't happen. In the NVMe driver case, this
would mean putting the device into a protocol low-power state before
returning from the suspend callback which may not be a bad idea
overall.
BTW, I'm wondering what could go wrong if the
pm_suspend_via_firmware() check were dropped from nvme_suspend()? The
driver would do some extra work on systems with ACPI S3 support, but
would it actually break anything? Does anybody know? May be worth
trying ...
> > > > However, there is an additional concern that putting an NVMe device
> > > > into D3cold every time during system suspend on Android might cause it
> > > > to wear out more quickly.
> > > >
> > >
> > > Right, this is the second problem.
> >
> > Let's set this one somewhat aside for now. We'll get back to it when
> > we have clarity in the above.
> >
>
> Ok. I believe this could be addressed in pci_choose_state() itself if required.
Possibly.
Unrelated notice: I will be mostly offline starting tomorrow through
Jan 6, 2025 and responses may be slow during that time.