RE: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle

From: Mario.Limonciello
Date: Wed May 08 2019 - 21:40:46 EST


> -----Original Message-----
> From: Christoph Hellwig <hch@xxxxxx>
> Sent: Wednesday, May 8, 2019 2:52 PM
> To: Limonciello, Mario
> Cc: kai.heng.feng@xxxxxxxxxxxxx; kbusch@xxxxxxxxxx; keith.busch@xxxxxxxxx;
> axboe@xxxxxx; hch@xxxxxx; sagi@xxxxxxxxxxx; linux-nvme@xxxxxxxxxxxxxxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on
> Suspend-to-Idle
>
>
> [EXTERNAL EMAIL]
>
> On Wed, May 08, 2019 at 07:38:50PM +0000, Mario.Limonciello@xxxxxxxx wrote:
> > The existing routines have an implied assumption that firmware will come
> swinging
> > with a hammer to control the rails the SSD sits on.
> > With S2I everything needs to come from the driver side and it really is a
> > different paradigm.
>
> And that is why is this patch is fundamentally broken.
>
> When using the simple pm ops suspend the pm core expects the device
> to be powered off. If fancy suspend doesn't want that we need to
> communicate what to do to the device in another way, as the whole
> thing is a platform decision. There probabl is one (or five) methods
> in dev_pm_ops that do the right thing, but please coordinate this
> with the PM maintainers to make sure it does the right thing and
> doesn't for example break either hibernate where we really don't
> expect just a lower power state, or

You might think this would be adding runtime_suspend/runtime_resume
callbacks, but those also get called actually at runtime which is not
the goal here. At runtime, these types of disks should rely on APST which
should calculate the appropriate latencies around the different power states.

This code path is only applicable in the suspend to idle state, which /does/
call suspend/resume functions associated with dev_pm_ops. There isn't
a dedicated function in there for use only in suspend to idle, which is
why pm_suspend_via_s2idle() needs to get called.

SIMPLE_DEV_PM_OPS normally sets the same function for suspend and
freeze (hibernate), so to avoid any changes to the hibernate case it seems
to me that there needs to be a new nvme_freeze() that calls into the existing
nvme_dev_disable for the freeze pm op and nvme_thaw() that calls into the
existing nvme_reset_ctrl for the thaw pm op.

> enterprise class NVMe devices
> that don't do APST and don't really do different power states at
> all in many cases.

Enterprise class NVMe devices that don't do APST - do they typically
have a non-zero value for ndev->ctrl.npss?

If not, they wouldn't enter this new codepath even if the server entered into S2I.