[PATCH v3 0/8] vfio/pci: power management changes

From: Abhishek Sahu
Date: Mon Apr 25 2022 - 05:26:53 EST

Currently, there is very limited power management support available
in the upstream vfio-pci driver. If there is no user of vfio-pci device,
then it will be moved into D3Hot state. Similarly, if we enable the
runtime power management for vfio-pci device in the guest OS, then the
device is being runtime suspended (for linux guest OS) and the PCI
device will be put into D3hot state (in function
vfio_pm_config_write()). If the D3cold state can be used instead of
D3hot, then it will help in saving maximum power. The D3cold state can't
be possible with native PCI PM. It requires interaction with platform
firmware which is system-specific. To go into low power states
(including D3cold), the runtime PM framework can be used which
internally interacts with PCI and platform firmware and puts the device
into the lowest possible D-States. This patch series registers the
vfio-pci driver with runtime PM framework and uses the same for moving
the physical PCI device to go into the low power state.

The current PM support was added with commit 6eb7018705de ("vfio-pci:
Move idle devices to D3hot power state") where the following point was
mentioned regarding D3cold state.

"It's tempting to try to use D3cold, but we have no reason to inhibit
hotplug of idle devices and we might get into a loop of having the
device disappear before we have a chance to try to use it."

With the runtime PM, if the user want to prevent going into D3cold then
/sys/bus/pci/devices/.../d3cold_allowed can be set to 0 for the
devices where the above functionality is required instead of
disallowing the D3cold state for all the cases.

Since D3cold state can't be achieved by writing PCI standard PM
config registers, so a feature has been added in DEVICE_FEATURE IOCTL
for low power related handling, which changes the PCI
device from D3hot to D3cold state and then D3cold to D0 state.
The hypervisors can implement virtual ACPI methods. For example,
in guest linux OS if PCI device ACPI node has _PR3 and _PR0 power
resources with _ON/_OFF method, then guest linux OS makes the _OFF call
during D3cold transition and then _ON during D0 transition. The
hypervisor can tap these virtual ACPI calls and then do the D3cold
related IOCTL in vfio driver.

The BAR access needs to be disabled if device is in D3hot state.
Also, there should not be any config access if device is in D3cold
state. For SR-IOV, the PF power state should be higher than VF's power

* Changes in v3

- Rebased patches on v5.18-rc3.
- Marked this series as PATCH instead of RFC.
- Addressed the review comments given in v2.
- Removed the limitation to keep device in D0 state if there is any
access from host side. This is specific to NVIDIA use case and
will be handled separately.
- Used the existing DEVICE_FEATURE IOCTL itself instead of adding new
IOCTL for power management.
- Removed all custom code related with power management in runtime
suspend/resume callbacks and IOCTL handling. Now, the callbacks
contain code related with INTx handling and few other stuffs and
all the PCI state and platform PM handling will be done by PCI core
functions itself.
- Add the support of wake-up in main vfio layer itself since now we have
more vfio/pci based drivers.
- Instead of assigning the 'struct dev_pm_ops' in individual parent
driver, now the vfio_pci_core tself assigns the 'struct dev_pm_ops'.
- Added handling of power management around SR-IOV handling.
- Moved the setting of drvdata in a separate patch.
- Masked INTx before during runtime suspended state.
- Changed the order of patches so that Fix related things are at beginning
of this patch series.
- Removed storing the power state locally and used one new boolean to
track the d3 (D3cold and D3hot) power state
- Removed check for IO access in D3 power state.
- Used another helper function vfio_lock_and_set_power_state() instead
of touching vfio_pci_set_power_state().
- Considered the fixes made in
and updated the patches accordingly.

* Changes in v2

- Rebased patches on v5.17-rc1.
- Included the patch to handle BAR access in D3cold.
- Included the patch to fix memory leak.
- Made a separate IOCTL that can be used to change the power state from
D3hot to D3cold and D3cold to D0.
- Addressed the review comments given in v1.

* v1

Abhishek Sahu (8):
vfio/pci: Invalidate mmaps and block the access in D3hot power state
vfio/pci: Change the PF power state to D0 before enabling VFs
vfio/pci: Virtualize PME related registers bits and initialize to zero
vfio/pci: Add support for setting driver data inside core layer
vfio/pci: Enable runtime PM for vfio_pci_core based drivers
vfio: Invoke runtime PM API for IOCTL request
vfio/pci: Mask INTx during runtime suspend
vfio/pci: Add the support for PCI D3cold state

.../vfio/pci/hisilicon/hisi_acc_vfio_pci.c | 4 +-
drivers/vfio/pci/mlx5/main.c | 3 +-
drivers/vfio/pci/vfio_pci.c | 4 +-
drivers/vfio/pci/vfio_pci_config.c | 63 ++-
drivers/vfio/pci/vfio_pci_core.c | 358 +++++++++++++++---
drivers/vfio/pci/vfio_pci_intrs.c | 6 +-
drivers/vfio/pci/vfio_pci_rdwr.c | 6 +-
drivers/vfio/vfio.c | 44 ++-
include/linux/vfio_pci_core.h | 12 +-
include/uapi/linux/vfio.h | 18 +
10 files changed, 445 insertions(+), 73 deletions(-)