Re: [PATCH 0/3] Allow specifying an S2RAM sleep on pre-SYSTEM_SUSPEND PSCI impls
From: Ulf Hansson
Date: Fri Dec 06 2024 - 04:54:41 EST
+ Maulik, Vincent
On Thu, 5 Dec 2024 at 21:34, Konrad Dybcio
<konrad.dybcio@xxxxxxxxxxxxxxxx> wrote:
>
> On 14.11.2024 4:30 PM, Ulf Hansson wrote:
> > On Mon, 28 Oct 2024 at 15:24, Konrad Dybcio <konradybcio@xxxxxxxxxx> wrote:
> >>
> >> Certain firmwares expose exactly what PSCI_SYSTEM_SUSPEND does through
> >> CPU_SUSPEND instead. Inform Linux about that.
> >> Please see the commit messages for a more detailed explanation.
> >>
> >> This is effectively a more educated follow-up to [1].
> >>
> >> The ultimate goal is to stop making Linux think that certain states
> >> only concern cores/clusters, and consequently setting
> >> pm_set_suspend/resume_via_firmware(), so that client drivers (such as
> >> NVMe, see related discussion over at [2]) can make informed decisions
> >> about assuming the power state of the device they govern.
> >
> > In my opinion, this is not really the correct way to do it. Using
> > pm_set_suspend/resume_via_firmware() works fine for x86/ACPI, but not
> > for PSCI like this. Let me elaborate. If the NVMe storage device is
> > sharing the same power-rail as the CPU cluster, then yes we should use
> > PSCI to control it. But is that really the case? If so, there are in
> > principle two ways forward to deal with this correctly.
> >
> > 1) If PSCI OSI mode is being used, the corresponding NVMe storage
> > device should be hooked up to the CPU PM cluster domain via genpd and
> > controlled as any other devices sharing the cluster-rail. In this way,
> > genpd together with the cpuidle-psci-domain can decide whether it's
> > okay to turn off the cluster. I believe this is the preferred way, but
> > 2) would work fine too.
> >
> > 2) If PSCI PC mode is being used, a separate channel/interface to the
> > FW (like SCMI or rpmh in the QC case), should inform the FW whether
> > NVMe needs the power to it. This information should then be taken into
> > account by the PSCI FW when it decides what low-power-state to enter,
> > which ultimately means whether the cluster-rail can be turned off or
> > not.
>
> This assumes PSCI only governs the CPU power rail. But what I'd
> guesstimate is that in most implementations if system-level suspend is
> there at all (no matter through which call), as per the spec, it at
> least also projects onto the DDR power state (like in this i.mx
> impl here [1]), or some uncore peripherals (like in Tegra's case with
> some secure element being toggled at [2])
Right, I certainly understand the above. There are different parts of
an SoC that may be sharing the same power-island as the CPUs.
The question here is whether the NVMe storage device is part of that
power-island too on some QC SoCs?
>
> > Assuming PSCI OSI mode is used here. Then if 1) doesn't work for you,
> > please elaborate on why, so we can help to make it work, as it should.
>
> On Qualcomm platforms, RPMh is the central authority when it comes
> to power governance, but by design, the CPUs must be off (and with a
> specific magic cookie) for the RPMh hardware to consider powering off
> very power hungry parts of the system, such as general i/o rails.
Right, that is why the "qcom,rpmh-rsc" device in many cases belongs to
the cluster-power-domain (for PSCI). This allows "qcom,rpmh-rsc" to
control the "last-man" activities and prevent deeper PSCI states
if/when necessary.
>
> So again, PSCI must be fed a specific value for the rest of the hw
> to react. The "S2RAM state" isn't really a cpuidle state, because
> it doesn't differ from many shallower states as far as the cpu/cluster
> are concerned. If that all isn't in place, the platform never actually
> enters any "real" sleep state, other than "CPU and some controllable
> IP blocks are runtime-suspended".
We recently discussed this, offlist, with Maulik - and I think we need
some more clarity around what is actually going on here.
In principle, it looks to me that using S2I with just another deeper
idlestate specified (with another psci-suspend-parameter, representing
a deeper state) should work fine, at least theoretically. Of course,
we may not be able to use that idlestate during regular
cpuidle/runtime but only during S2I, which we need to control in a
smooth way and that is not currently supported (but can be fixed
easily, I think).
In the end, it's the psci-suspend-parameter that is given to the PSCI
FW that informs about what state we can enter.
That said, using S2I may not work without updating the PSCI FW, of
course. For example, there may be FW limitations that require the
boot-CPU( CPU0) to be the last one for these deeper low-power-states.
Whether that is just a FW limitation or whether there are some
additional HW constraints that enforce this, needs to be clarified.
>
> This effectively is very close to what ACPI+x86 do - there's a
> co-processor/firmware that does a lot of things behind your back and
> all you can do is *ask* it to change some handwavily-defined P/Cstate
> that affects a huge chunk of silicon.
Yep, there are similarities.
However, ACPI is for generic device power management. PSCI requires
something additional, such as ARM SCMI or QC's rpm/rsc interface.
>
> Konrad
>
> [1] https://github.com/nxp-imx/imx-atf/blob/lf_v2.6/plat/imx/imx8m/imx8mp/imx8mp_lpa_psci.c#L474
> [2] https://github.com/ARM-software/arm-trusted-firmware/blob/master/plat/nvidia/tegra/soc/t210/plat_psci_handlers.c#L214
>
Kind regards
Uffe