Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user

From: Rafael J. Wysocki
Date: Mon Dec 16 2024 - 12:36:32 EST


On Mon, Dec 16, 2024 at 6:11 PM Manivannan Sadhasivam
<manivannan.sadhasivam@xxxxxxxxxx> wrote:
>
> On Mon, Dec 16, 2024 at 05:24:40PM +0100, Rafael J. Wysocki wrote:
> > On Sat, Dec 14, 2024 at 7:30 AM Manivannan Sadhasivam
> > <manivannan.sadhasivam@xxxxxxxxxx> wrote:
> > >
> > > On Fri, Dec 13, 2024 at 03:35:15PM +0100, Rafael J. Wysocki wrote:
> > > > On Thu, Dec 12, 2024 at 4:14 PM Christoph Hellwig <hch@xxxxxx> wrote:
> > > > >
> > > > > On Thu, Dec 12, 2024 at 01:49:15PM +0100, Ulf Hansson wrote:
> > > > > > Right. This seems to somewhat work for ACPI types of systems, because
> > > > > > ACPI is controlling the low power state for all the devices. Based on
> > > > > > the requested system wide low power state, ACPI can then decide to
> > > > > > call pm_set_suspend_via_firmware() or not.
> > > > > >
> > > > > > Still there is a problem with this for ACPI too.
> > > > > >
> > > > > > How does ACPI know whether it's actually a good idea to keep the NVMe
> > > > > > storage powered in s2idle (ACPI calls pm_set_suspend_via_firmware()
> > > > > > only for S2R and S2disk!?)? Especially when my laptop only supports
> > > > > > s2idle and that's what I will use when I close the lid. In this way,
> > > > > > the NMVe storage will certainly contribute to draining the battery,
> > > > > > especially when I won't be using my laptop for a couple of days.
> > > > > >
> > > > > > In my opinion, we need a better approach that is both flexible and
> > > > > > that dynamically adjusts based upon the use case.
> > > > >
> > > > > Agreed. I'd be happy to work with the PM maintainers to do this,
> > > > > but I don't really know enough about the PM core to drive it
> > > > > (as the reply from Rafael to my mail makes pretty clear :))
> > > >
> > > > I'm here to help.
> > > >
> > > > Let me know what exactly you want to achieve and we'll see how to make it work.
> > >
> > > I'll try to summarize the requirement here since I started this thread:
> > >
> > > Problem statement
> > > =================
> > >
> > > We need a PM core API that tells the device drivers when it is safe to powerdown
> > > the devices. The usecase here is with PCIe based NVMe devices but the problem is
> > > applicable to other devices as well.
> > >
> > > Drivers are relying on couple of options now:
> > >
> > > 1. If pm_suspend_via_firmware() returns true, then drivers will shutdown the
> > > device assuming that the firmware is going to handle the suspend. But this API
> > > is currently used only by ACPI. Even there, ACPI relies on S2R being supported
> > > by the platform and it sets pm_set_suspend_via_firmware() only when the suspend
> > > is S2R. But if the platform doesn't support S2R (current case of most of the
> > > Qcom SoCs), then pm_suspend_via_firmware() will return false and NVMe won't be
> > > powered down draining the battery.
> >
> > So my question here would be why is it not powered down always during
> > system-wide suspend?
> >
> > Why exactly is it necessary to distinguish one case from the other
> > (assuming that we are talking about system-wide suspend only)?
> >
>
> To support Android like usecase with firmware that only supports
> suspend-to-idle (Qcom platforms). This usecase is not applicable right now, but
> one can't just rule out the possibility in the near future.

This doesn't explain anything to me, sorry.

> And the problem is that we need to support both Android and non-Android systems
> with same firmware :/

So what technically is the problem?

> > There are drivers that use pm_suspend_via_firmware() to check whether
> > or not something special needs to be done to the device because if
> > "false" is returned, the platform firmware is not going to remove
> > power from it.
> >
> > However, you seem to be talking about the opposite, so doing something
> > special to the device if "true" is returned. I'm not sure why this is
> > necessary.
> >
>
> Because, since 'false' is returned, drivers like NVMe are assuming that the
> platform won't remove power on all DT systems and they just keep the devices in
> low power state (not powering them down). But why would I want my NVMe in DT
> based laptop to be always powered in system suspend?

Because it causes the system to use less energy when suspended.

> > > If the platform is using DT, then there is no entity setting
> > > pm_set_suspend_via_firmware().
> >
> > That's true and so the assumption is that in this case the handling of
> > all devices will always be the same regardless of which flavor of
> > system suspend is chosen by user space.
> >
>
> Right and that's why the above concern is raised.

And it is still unclear to me what the problem with it is.

What exactly can go wrong?

> > > So NVMe will be kept in low power state all the
> > > time (still draining the battery).
> >
> > So what would be the problem with powering it down unconditionally?
> >
>
> I'm not sure how would you do that (by checking dev_of_node()?). Even so, it
> will wear out the NVMe devices if used in Android tablets etc...

I understand the wear-out concern.

Is there anything else?

> > > There were attempts to set this flag from
> > > PSCI [1], but there were objections on setting this flag when PSCI_SUSPEND is
> > > not supported by the platform (again, the case with Qcom SoCs). Even if this
> > > approach succeeds, then there are concerns that if the platform is used in an
> > > OS like Android where the S2Idle cycle is far more high, NVMe will wear out
> > > very quickly.
> >
> > I see.
> >
> > > So this is where the forthcoming API need to "dynamically adjusts
> > > based upon the use case" as quoted by Ulf in his previous reply. One way to
> > > achieve would be by giving the flexibility to the userspace to choose the
> > > suspend state (if platform has options to select). UFS does something similar
> > > with 'spm_lvl' [2] sysfs attribute that I believe Android userspace itself makes
> > > use of.
> >
> > Before we're talking about APIs, let's talk about the desired behavior.
> >
> > It looks like there are cases in which you'd want to turn the device
> > off completely (say put it into D3cold in the PCI terminology) and
> > there are cases in which you'd want it to stay in a somewhat-powered
> > low-power state.
> >
> > It is unclear to me what they are at this point.
> >
>
> I hope that my above explanation clarifies.

Sorry, but not really.

> Here is the short version of the suspend requirement across platforms:
>
> 1. D3Cold (power down) - Laptops/Automotive
> 2. D3hot (low power) - Android Tablets

Where do the above requirements come from?

> FWIW, I did receive feedback from people asking to just ignore the Android
> usecase and always power down the devices for DT platforms. But I happen to
> disagree with them. Let me know if I was wrong and I should not worry about
> Android usecase as it is for the future.

I'm not sure what you mean by the "Android usecase" TBH. Do you mean
the wear-out concern in the Android usage scenario or is there more to
it?