Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user

From: Manivannan Sadhasivam
Date: Thu Dec 19 2024 - 03:03:50 EST


On Tue, Dec 17, 2024 at 08:45:55PM +0100, Rafael J. Wysocki wrote:
> On Tue, Dec 17, 2024 at 6:26 AM <manivannan.sadhasivam@xxxxxxxxxx> wrote:
> >
> > On Mon, Dec 16, 2024 at 08:34:24PM +0100, Rafael J. Wysocki wrote:
> >
> > [...]
> >
> > > > There is also a case where some devices like
> > > > (Laptops made out of Qcom SCX Gen3 SoCs) require all the PCIe devices to be
> > > > powered down in order for the SoC to reach its low power state (CX power
> > > > collapse in Qcom terms). If not, the SoC would continue to draw more power
> > > > leading to battery draining quickly. This platform is supported in upstream and
> > > > we keep the PCIe interconnect voted during suspend as the NVMe driver is
> > > > expecting the device to retain its state during resume. Because of this
> > > > requirement, this platform is not reaching SoC level low power state with
> > > > upstream kernel.
> > >
> > > OK, now all of this makes sense and that's why you really want NVMe
> > > devices to end up in some form of PCI D3 in suspend-to-idle.
> > >
> > > Would D3hot be sufficient for this platform or does it need to be
> > > D3cold? If the latter, what exactly is the method by which they are
> > > put into D3cold?
> > >
> >
> > D3Cold is what preferred. Earlier the controller driver used to transition the
> > device into D3Cold by sending PME_Turn_Off, turning off refclk, regulators
> > etc... Now we have a new framework called 'pwrctrl' that handles the
> > clock/regulators supplied to the device. So both controller and pwrctrl drivers
> > need to work in a tandem to put the device into D3Cold.
> >
> > Once the PCIe client driver (NVMe in this case) powers down the device, then
> > controller/pwrctrl drivers will check the PCIe device state and transition the
> > device into D3Cold. This is a TODO btw.
> >
> > But right now there is no generic D3Cold handling available for DT platforms. I
> > am hoping to fix that too once this NVMe issue is handled.
>
> There's no generic D3cold handling for PCIe devices at all AFAICS. At
> least, I'm not aware of any standard way to do it. Yes, there are
> vendor-specific conventions that may even be followed by multiple
> vendors, but not much beyond that.
>

Yeah, right. Atleast ACPI has its own way of handling D3Cold and that's what I
meant. There is no such option available for DT right now. I was shoping that
once this NVMe issue is resolved, then I could look into D3Cold for DT
platforms.

> > > > > > > > If the platform is using DT, then there is no entity setting
> > > > > > > > pm_set_suspend_via_firmware().
> > > > > > >
> > > > > > > That's true and so the assumption is that in this case the handling of
> > > > > > > all devices will always be the same regardless of which flavor of
> > > > > > > system suspend is chosen by user space.
> > > > > > >
> > > > > >
> > > > > > Right and that's why the above concern is raised.
> > > > >
> > > > > And it is still unclear to me what the problem with it is.
> > > > >
> > > > > What exactly can go wrong?
> > > > >
> > > > > > > > So NVMe will be kept in low power state all the
> > > > > > > > time (still draining the battery).
> > > > > > >
> > > > > > > So what would be the problem with powering it down unconditionally?
> > > > > > >
> > > > > >
> > > > > > I'm not sure how would you do that (by checking dev_of_node()?). Even so, it
> > > > > > will wear out the NVMe devices if used in Android tablets etc...
> > > > >
> > > > > I understand the wear-out concern.
> > > > >
> > > > > Is there anything else?
> > > > >
> > > >
> > > > No, that's the only concern.
> > >
> > > OK
> > >
> > > I think we're getting to the bottom of the issue.
> > >
> > > First off, there really is no requirement to avoid putting PCI devices
> > > into D3hot or D3cold during suspend-to-idle. On all modern Intel
> > > platforms, PCIe devices are put into D3(hot or cold) during
> > > suspend-to-idle and I don't see why this should be any different on
> > > platforms from any other vendors.
> > >
> > > The NVMe driver is an exception and it avoids D3(hot or cold) during
> > > suspend-to-idle because of problems with some hardware or platforms.
> > > It might in principle allow devices to go into D3(hot or cold) during
> > > suspend-to-idle, so long as it knows that this is going to work.
> > >
> >
> > Slight correction here.
> >
> > NVMe driver avoids PCI PM _only_ when it wants to handle the NVMe power
> > state on its own, not all the time. It has some checks [1] in the suspend path
> > and if the platform/device passes one of the checks, it will power down the
> > device.
>
> Yes, there is a comment in that code explaining what's going on and
> that is basically "if key ingredients are missing or the firmware is
> going to do its thing anyway, don't bother".
>
> > DT platforms doesn't pass any of the checks, so the NVMe driver always manages
> > the power state on its own. Unfortunately, the resultant power saving is not
> > enough, so the vendors (Laptop/Automotive) using DT want NVMe to be powered down
> > all the time. This is the first problem we are trying to solve.
> >
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvme/host/pci.c#n3448
>
> I see.
>
> This cannot be done by the driver itself, though, at least not in
> general. The PCI layer needs to be involved and, if we are talking
> about D3cold, the platform firmware needs to be involved either.
>

Right, but the device driver needs to have some idea about what state PCI core
is going to choose for the device. I believe that's the purpose of
pci_choose_state() API. More below...

> As a rule, the PCI layer reaches out to the platform firmware for help
> as needed and drivers don't take part in this directly.
>
> The NVMe driver would need to let the PCI layer take over and set the
> PCI power state of the device if it wanted to get any deeper than NVMe
> protocol specific power states.
>
> In principle, this can be addressed with some kind of a driver opt-in.
> That is, the NVMe driver would continue to work the way it does, but
> instead of completely preventing the PCI layer from taking over, it
> would opt in (the exact opt-in mechanism is TBD) for D3cold if (a) the
> platform firmware provides a mechanism to do so and (b) the DT
> indicates that this mechanism should be used for the given device.
>

Ok, IIUC you are talking about something like this?

static int nvme_suspend(struct device *dev)
{
...

if (pm_suspend_via_firmware() || !ctrl->npss || ... ||
pci_choose_state(pdev, PMSG_SUSPEND) == PCI_D3cold)
return nvme_disable_prepare_reset(ndev, true);

/* continue using protocol specific low power state */

...
}

Here, pci_choose_state() should tell the driver if the device should enter
D3Cold. ACPI already supports this API, now I need to add DT support (which is
not straightforward, but doable). Since this API is already exported, I think it
makes perfect sense to use it here (and other drivers for similar usecase).

> > > However, there is an additional concern that putting an NVMe device
> > > into D3cold every time during system suspend on Android might cause it
> > > to wear out more quickly.
> > >
> >
> > Right, this is the second problem.
>
> Let's set this one somewhat aside for now. We'll get back to it when
> we have clarity in the above.
>

Ok. I believe this could be addressed in pci_choose_state() itself if required.

> > > Is there anything else?
> >
> > We also need to consider the fact that the firmware in some platforms doesn't
> > support S2R. So we cannot take a decision based on S2I/S2R alone.
>
> Right, the S2I/S2R thing is ACPI-specific.
>
> On platforms using ACPI, pm_suspend_via_firmware() means that the
> firmware is at least likely to power down the whole platform, so the
> PCI layer may as well be allowed to do what it wants with the device.
>
> > I think there are atleast a couple of ways to solve above mentioned problems:
> >
> > 1. Go extra mile, take account of all issues/limitations mentioned above and
> > come up with an API. This API will provide a sane default suspend behavior to
> > drivers (fixing first problem) and will also allow changing the behavior
> > dynamically (to fix the second problem where kernel cannot distinguish Android
> > vs other OS).
>
> I don't quite follow TBH.
>
> A "default suspend behavior" is there already and it is to allow the
> PCI layer to set the device's power state (in collaboration with the
> platform firmware). Thing is, the NVMe driver doesn't always want to
> rely on it.
>
> > 2. Allow DT platforms to call pm_set_suspend_via_firmware() and make use of the
> > existing behavior in the NVMe driver. This would only solve the first problem
> > though. But would be a good start.
>
> That would mean just letting the PCI layer always take over on the
> platforms that would call pm_set_suspend_via_firmware().
>
> There is a potential issue with doing it, which is that everybody
> calling pm_suspend_via_firmware() would then assume that the platform
> would be powered down by the firmware. I'm not sure how much of an
> issue that would be in practice, though.

Yeah, that would be a concern.

- Mani

--
மணிவண்ணன் சதாசிவம்