Re: [PATCH] nvme-pci: Shutdown the device if D3Cold is allowed by the user
From: Manivannan Sadhasivam
Date: Mon Dec 09 2024 - 10:01:10 EST
On Thu, Dec 05, 2024 at 05:29:00PM -0600, Bjorn Helgaas wrote:
> On Sat, Nov 23, 2024 at 02:31:13PM +0530, Manivannan Sadhasivam wrote:
> > On Fri, Nov 22, 2024 at 04:20:50PM -0600, Bjorn Helgaas wrote:
> > > On Mon, Nov 18, 2024 at 01:53:44PM +0530, Manivannan Sadhasivam wrote:
> > > > PCI core allows users to configure the D3Cold state for each PCI
> > > > device through the sysfs attribute
> > > > '/sys/bus/pci/devices/.../d3cold_allowed'. This attribute sets
> > > > the 'pci_dev:d3cold_allowed' flag and could be used by users to
> > > > allow/disallow the PCI devices to enter D3Cold during system
> > > > suspend.
> > > >
> > > > So make use of this flag in the NVMe driver to shutdown the NVMe
> > > > device during system suspend if the user has allowed D3Cold for
> > > > the device. Existing checks in the NVMe driver decide whether
> > > > to shut down the device (based on platform/device limitations),
> > > > so use this flag as the last resort to keep the existing
> > > > behavior.
> > > >
> > > > The default behavior of the 'pci_dev:d3cold_allowed' flag is to
> > > > allow D3Cold and the users can disallow it through sysfs if they
> > > > want.
> > >
> > > What problem does this solve? I guess there must be a case where
> > > suspend leaves NVMe in a higher power state than you want?
> >
> > Yeah, this is the case for all systems that doesn't fit into the
> > existing checks in the NVMe suspend path:
> >
> > 1. ACPI based platforms
> > 2. Controller doesn't support NPSS (hardware issue/limitation)
> > 3. ASPM not enabled
> > 4. Devices/systems setting NVME_QUIRK_SIMPLE_SUSPEND flag
> >
> > In my case, all the Qualcomm SoCs using Devicetree doesn't fall into
> > the above checks. Hence, they were not fully powered down during
> > system suspend and always in low power state. This means, I cannot
> > achieve 'CX power collapse', a Qualcomm specific SoC powered down
> > state that consumes just enough power to wake up the SoC. Since the
> > controller driver keeps the PCI resource vote because of NVMe, the
> > firmware in the Qualcomm SoCs cannot put the SoC into above
> > mentioned low power state.
>
> IIUC nvme_suspend() has two paths:
>
> - Do nvme_disable_prepare_reset() without calling pci_save_state(),
> so the PCI core chooses and sets the low-power state.
>
> - Put the device in an NVMe-specific low-power state and call
> pci_save_state(), which prevents the PCI core from putting the
> device in a low-power state.
>
> (The PCI core part is in pci_pm_suspend_noirq(),
> pci_pm_poweroff_noirq(), pci_pm_runtime_suspend())
>
> And I guess you want the first path for basically all systems? The
> only systems that would use the NVMe-specific path are those where:
>
> - !pm_suspend_via_firmware() (not an ACPI system), AND
>
> - ctrl->npss (device supports NVMe power states), AND
>
> - pcie_aspm_enabled(), AND
>
> - !NVME_QUIRK_SIMPLE_SUSPEND (it's not something with a weird
> quirk), AND
>
> - !pdev->d3cold_allowed (user has cleared it via sysfs)
>
> This frankly seems almost unintelligible to me, so I'm glad I'm not
> responsible for nvme :)
>
I agree that using the attribute is not a great idea in the NVMe driver where
there are already a handful of quirks/checks, but that seems unavoidable.
> > > I'm not sure the use of pdev->d3cold_allowed here really expresses
> > > your underlying intent. It suggests that you're really hoping for
> > > D3cold, but that's only a possibility if firmware supports it, and
> > > we have no visibility into that here.
> >
> > I'm not relying on firmware to do anything here. If firmware has to
> > decide the suspend state, it should already satisfy the
> > pm_suspend_via_firmware() check in nvme_suspend(). ...
>
> I'm confused about this because we want to use PCI core power
> management, which chooses the new state with pci_target_state(),
> which looks like it will choose D3hot unless we're on an ACPI system
> and acpi_pci_choose_state() returns D3cold.
>
> But your system is not an ACPI system, so we should get D3hot, but yet
> you decide based on D3*cold* being allowed?
>
This is an existing ungliness of DT platforms. D3Cold is not tied to the PCI
core, but the controller drivers decide on their own.
> > Here, the controller driver takes care of putting the device into
> > D3Cold. Currently, the controller drivers cannot do it (on DT
> > platforms) because of NVMe driver's behavior.
>
> I'm missing the connection to the controller driver (I assume you mean
> qcom-pcie?). Maybe it's that having the NVMe device in a PCI
> low-power state allows qcom_pcie_suspend_noirq() to reduce the ICC
> bandwidth vote and do other power-saving things? And it can't do
> those things if we're using the NVMe low-power state because that
> state is not visible to qcom-pcie?
>
Yes. In DT platforms, peripheral power state is not controlled by the firmware
to some extent. For PCIe, the controller driver is responsible for handling the
endpoint devices power state. As you referred qcom-pcie driver, it currently has
the 1 Kbps ICC vote in qcom_pcie_suspend_noirq() just because we cannot fully
remove the ICC vote due to NVMe. Because of this, even if NVMe is in its optimal
low power state, the SoC cannot enter its own low power state. Plus it doesn't
make a lot of sense to keep NVMe in low power mode even if you suspend your DT
based laptop for hours.
- Mani
--
மணிவண்ணன் சதாசிவம்