Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle

From: Christoph Hellwig
Date: Fri May 10 2019 - 01:32:29 EST


> +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss)
> +{
> + int ret;
> +
> + mutex_lock(&ctrl->scan_lock);
> + nvme_start_freeze(ctrl);
> + nvme_wait_freeze(ctrl);
> + ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0,
> + NULL);
> + nvme_unfreeze(ctrl);
> + mutex_unlock(&ctrl->scan_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(nvme_set_power);

I think we should have this in the PCIe driver, givn that while in
theory power states are generic in practice they are only applicable
to PCIe. To be revisited if history proves me wrong.

Also I don't see any reason why we'd need to do the freeze game on
resume. Even on suspend it looks a little odd to me, as in theory
the PM core should have already put the system into a quiescent state.
But maybe we actually need it there, in which case a comment would
be helpful.

> + if (!pm_suspend_via_firmware())

pm_suspend_via_firmware is a weird name and has absolutely zero
documentation. So I think we really need a big fat comment with the
wisdom from this thread here.

> + return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss);

I think we need to skip this code path is NPSS is zero, as that
indicates that the device doesn't actually do power states and fall
back to the full teardown.

Also I can't find anything except for this odd sentences in the spec:

"Hosts that do not dynamically manage power should set the power
state to the lowest numbered state that satisfies the PCI Express
slot power limit control value.

that requires the power states to be ordered in any particular way.
So we probably have to read through the table at probing time and find
the lowest power state there.

Rafael also brought up the issue of not entering D3, and the somewhat
non-intuitive to me solution for it, so I'm not commenting on that
except for asking on a comment on that save_state call.

> + if (!pm_suspend_via_firmware())
> + return nvme_set_power(&ndev->ctrl, 0);

Don't we need to save the previous power state here and restore that?
For example if someone set a specific state through nvme-cli?