[RFC 3/3] vfio/pci: use runtime PM for vfio-device into low power state

From: abhsahu
Date: Mon Nov 15 2021 - 08:37:40 EST


From: Abhishek Sahu <abhsahu@xxxxxxxxxx>

Currently, if the runtime power management is enabled for vfio-pci
device in guest OS, then guest OS will do the register write for
PCI_PM_CTRL register. This write request will be handled in
vfio_pm_config_write() where it will do the actual register write of
PCI_PM_CTRL register. With this, the maximum D3hot state can be
achieved for low power. If we can use the runtime PM framework, then
we can achieve the D3cold state which will help in saving
maximum power.

This patch uses runtime PM framework whenever vfio-pci device will
be put in the low power state.

1. If runtime PM is enabled, then instead of directly writing
PCI_PM_CTRL register, decrement the device usage counter whenever
the power state is non-D0. The kernel runtime PM framework will
itself put the PCI device in low power state when device usage
counter will become zero. Similarly, when the power state will be
again changed back to D0, then increment the device usage counter
and the kernel runtime PM framework will itself bring the PCI device
out of low power state.

2. The guest OS will read the PCI_PM_CTRL register back to
confirm the current power state so virtual register bits can be
used. For this, before decrementing the usage count, read the actual
PCI_PM_CTRL register, update the power state related bits, and then
update the vconfig bits corresponding to PCI_PM_CTRL offset. For
PCI_PM_CTRL register read, return the virtual value if runtime PM is
requested. This vconfig bits will be cleared when the power state
will be changed back to D0.

3. For the guest OS, the PCI power state will be still D3hot
even if put the actual PCI device into D3cold state. In the D3hot
state, the config space can still be read. So, if there is any request
from guest OS to read the config space, then we need to move the actual
PCI device again back to D0. For this, increment the device usage
count before reading/writing the config space and then decrement it
again after reading/writing the config space. There can be
back-to-back config register read/write request, but since the auto
suspend methods are being used here so only first access will
wake up the PCI device and for the remaining access, the device will
already be active.

4. This above-mentioned wake up is not needed if the register
read/write is done completely with virtual bits. For handling this
condition, the actual resume of device will only being done in
vfio_user_config_read()/vfio_user_config_write(). All the config
register access will come vfio_pci_config_rw(). So, in this
function, use the pm_runtime_get_noresume() so that only usage count
will be incremented without resuming the device. Inside,
vfio_user_config_read()/vfio_user_config_write(), use the routines
with pm_runtime_put_noidle() so that the PCI device won’t be
suspended in the lower level functions. Again in the top level
vfio_pci_config_rw(), use the pm_runtime_put_autosuspend(). Now the
auto suspend timer will be started and the device will be suspended
again. If the device is already runtime suspended, then this routine
will return early.

5. In the host side D3cold will only be used if the platform has
support for this, otherwise some other state will be used. The
config space can be read if the device is not in D3cold state. So in
this case, we can skip the resuming of PCI device. The wrapper
function vfio_pci_config_pm_runtime_get() takes care of this
condition and invoke the pm_runtime_resume() only if current power
state is D3cold.

6. For vfio_pci_config_pm_runtime_get()/vfio_
pci_config_pm_runtime_put(), the reference code is taken from
pci_config_pm_runtime_get()/pci_config_pm_runtime_put() and then it
is modified according to vfio-pci driver requirement.

7. vfio_pci_set_power_state() will be unused after moving to runtime
PM, so this function can be removed along with other related
functions and structure fields.

Signed-off-by: Abhishek Sahu <abhsahu@xxxxxxxxxx>
---
drivers/vfio/pci/vfio_pci_config.c | 178 ++++++++++++++++++++++++++---
drivers/vfio/pci/vfio_pci_core.c | 64 ++---------
include/linux/vfio_pci_core.h | 3 +-
3 files changed, 174 insertions(+), 71 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index fb3a503a5b99..5576eb4308b4 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -25,6 +25,7 @@
#include <linux/uaccess.h>
#include <linux/vfio.h>
#include <linux/slab.h>
+#include <linux/pm_runtime.h>

#include <linux/vfio_pci_core.h>

@@ -119,12 +120,51 @@ struct perm_bits {
#define NO_WRITE 0
#define ALL_WRITE 0xFFFFFFFFU

-static int vfio_user_config_read(struct pci_dev *pdev, int offset,
+static void vfio_pci_config_pm_runtime_get(struct pci_dev *pdev)
+{
+ struct device *dev = &pdev->dev;
+ struct device *parent = dev->parent;
+
+ if (parent)
+ pm_runtime_get_sync(parent);
+
+ pm_runtime_get_noresume(dev);
+ /*
+ * pdev->current_state is set to PCI_D3cold during suspending,
+ * so wait until suspending completes
+ */
+ pm_runtime_barrier(dev);
+ /*
+ * Only need to resume devices in D3cold, because config
+ * registers are still accessible for devices suspended but
+ * not in D3cold.
+ */
+ if (pdev->current_state == PCI_D3cold)
+ pm_runtime_resume(dev);
+}
+
+static void vfio_pci_config_pm_runtime_put(struct pci_dev *pdev)
+{
+ struct device *dev = &pdev->dev;
+ struct device *parent = dev->parent;
+
+ pm_runtime_mark_last_busy(dev);
+ pm_runtime_put_noidle(dev);
+
+ if (parent)
+ pm_runtime_put_noidle(parent);
+}
+
+static int vfio_user_config_read(struct vfio_pci_core_device *vdev, int offset,
__le32 *val, int count)
{
+ struct pci_dev *pdev = vdev->pdev;
int ret = -EINVAL;
u32 tmp_val = 0;

+ if (vdev->pm_runtime_suspended)
+ vfio_pci_config_pm_runtime_get(pdev);
+
switch (count) {
case 1:
{
@@ -147,15 +187,22 @@ static int vfio_user_config_read(struct pci_dev *pdev, int offset,

*val = cpu_to_le32(tmp_val);

+ if (vdev->pm_runtime_suspended)
+ vfio_pci_config_pm_runtime_put(pdev);
+
return ret;
}

-static int vfio_user_config_write(struct pci_dev *pdev, int offset,
+static int vfio_user_config_write(struct vfio_pci_core_device *vdev, int offset,
__le32 val, int count)
{
+ struct pci_dev *pdev = vdev->pdev;
int ret = -EINVAL;
u32 tmp_val = le32_to_cpu(val);

+ if (vdev->pm_runtime_suspended)
+ vfio_pci_config_pm_runtime_get(pdev);
+
switch (count) {
case 1:
ret = pci_user_write_config_byte(pdev, offset, tmp_val);
@@ -168,6 +215,9 @@ static int vfio_user_config_write(struct pci_dev *pdev, int offset,
break;
}

+ if (vdev->pm_runtime_suspended)
+ vfio_pci_config_pm_runtime_put(pdev);
+
return ret;
}

@@ -183,11 +233,10 @@ static int vfio_default_config_read(struct vfio_pci_core_device *vdev, int pos,

/* Any non-virtualized bits? */
if (cpu_to_le32(~0U >> (32 - (count * 8))) != virt) {
- struct pci_dev *pdev = vdev->pdev;
__le32 phys_val = 0;
int ret;

- ret = vfio_user_config_read(pdev, pos, &phys_val, count);
+ ret = vfio_user_config_read(vdev, pos, &phys_val, count);
if (ret)
return ret;

@@ -224,18 +273,17 @@ static int vfio_default_config_write(struct vfio_pci_core_device *vdev, int pos,

/* Non-virtualzed and writable bits go to hardware */
if (write & ~virt) {
- struct pci_dev *pdev = vdev->pdev;
__le32 phys_val = 0;
int ret;

- ret = vfio_user_config_read(pdev, pos, &phys_val, count);
+ ret = vfio_user_config_read(vdev, pos, &phys_val, count);
if (ret)
return ret;

phys_val &= ~(write & ~virt);
phys_val |= (val & (write & ~virt));

- ret = vfio_user_config_write(pdev, pos, phys_val, count);
+ ret = vfio_user_config_write(vdev, pos, phys_val, count);
if (ret)
return ret;
}
@@ -250,7 +298,7 @@ static int vfio_direct_config_read(struct vfio_pci_core_device *vdev, int pos,
{
int ret;

- ret = vfio_user_config_read(vdev->pdev, pos, val, count);
+ ret = vfio_user_config_read(vdev, pos, val, count);
if (ret)
return ret;

@@ -275,7 +323,7 @@ static int vfio_raw_config_write(struct vfio_pci_core_device *vdev, int pos,
{
int ret;

- ret = vfio_user_config_write(vdev->pdev, pos, val, count);
+ ret = vfio_user_config_write(vdev, pos, val, count);
if (ret)
return ret;

@@ -288,7 +336,7 @@ static int vfio_raw_config_read(struct vfio_pci_core_device *vdev, int pos,
{
int ret;

- ret = vfio_user_config_read(vdev->pdev, pos, val, count);
+ ret = vfio_user_config_read(vdev, pos, val, count);
if (ret)
return ret;

@@ -692,13 +740,86 @@ static int __init init_pci_cap_basic_perm(struct perm_bits *perm)
return 0;
}

+static int vfio_perform_runtime_pm(struct vfio_pci_core_device *vdev, int pos,
+ int count, struct perm_bits *perm,
+ int offset, __le32 val, pci_power_t state)
+{
+ /*
+ * If runtime PM is enabled, then instead of directly writing
+ * PCI_PM_CTRL register, decrement the device usage counter whenever
+ * the power state is non-D0. The kernel runtime PM framework will
+ * itself put the PCI device in the low power state when device usage
+ * counter will become zero. The guest OS will read the PCI_PM_CTRL
+ * register back to confirm the current power state so virtual
+ * register bits can be used. For this, read the actual PCI_PM_CTRL
+ * register, update the power state related bits and then update the
+ * vconfig bits corresponding to PCI_PM_CTRL offset. If the
+ * pm_runtime_suspended status is set, then return the virtual
+ * register value for PCI_PM_CTRL register read. All the bits
+ * in PCI_PM_CTRL are being returned from virtual config, so that
+ * this register read will not wake up the PCI device from suspended
+ * state.
+ *
+ * Once power state will be changed back to D0, then clear the power
+ * state related bits in vconfig. After this, increment the device
+ * usage counter which will internally wake up the PCI device from
+ * suspended state.
+ */
+ if (state != PCI_D0 && !vdev->pm_runtime_suspended) {
+ __le32 virt_val = 0;
+
+ count = vfio_default_config_write(vdev, pos, count, perm,
+ offset, val);
+ if (count < 0)
+ return count;
+
+ vfio_default_config_read(vdev, pos, 4, perm, offset, &virt_val);
+ virt_val &= ~cpu_to_le32(PCI_PM_CTRL_STATE_MASK);
+ virt_val |= (val & cpu_to_le32(PCI_PM_CTRL_STATE_MASK));
+ memcpy(vdev->vconfig + pos, &virt_val, 4);
+ vdev->pm_runtime_suspended = true;
+ pm_runtime_mark_last_busy(&vdev->pdev->dev);
+ pm_runtime_put_autosuspend(&vdev->pdev->dev);
+ return count;
+ }
+
+ if (vdev->pm_runtime_suspended && state == PCI_D0) {
+ vdev->pm_runtime_suspended = false;
+ *(__le16 *)&vdev->vconfig[pos] &=
+ ~cpu_to_le16(PCI_PM_CTRL_STATE_MASK);
+ pm_runtime_get_sync(&vdev->pdev->dev);
+ }
+
+ return vfio_default_config_write(vdev, pos, count, perm, offset, val);
+}
+
+static int vfio_pm_config_read(struct vfio_pci_core_device *vdev, int pos,
+ int count, struct perm_bits *perm,
+ int offset, __le32 *val)
+{
+ /*
+ * If pm_runtime_suspended status is set, then return the virtual
+ * register value for PCI_PM_CTRL register read.
+ */
+ if (vdev->pm_runtime_suspended &&
+ offset >= PCI_PM_CTRL && offset < (PCI_PM_CTRL + 4)) {
+ memcpy(val, vdev->vconfig + pos, count);
+ return count;
+ }
+
+ return vfio_default_config_read(vdev, pos, count, perm, offset, val);
+}
+
static int vfio_pm_config_write(struct vfio_pci_core_device *vdev, int pos,
int count, struct perm_bits *perm,
int offset, __le32 val)
{
- count = vfio_default_config_write(vdev, pos, count, perm, offset, val);
- if (count < 0)
- return count;
+ if (offset != PCI_PM_CTRL) {
+ count = vfio_default_config_write(vdev, pos, count, perm,
+ offset, val);
+ if (count < 0)
+ return count;
+ }

if (offset == PCI_PM_CTRL) {
pci_power_t state;
@@ -718,7 +839,8 @@ static int vfio_pm_config_write(struct vfio_pci_core_device *vdev, int pos,
break;
}

- vfio_pci_set_power_state(vdev, state);
+ count = vfio_perform_runtime_pm(vdev, pos, count, perm,
+ offset, val, state);
}

return count;
@@ -731,6 +853,7 @@ static int __init init_pci_cap_pm_perm(struct perm_bits *perm)
return -ENOMEM;

perm->writefn = vfio_pm_config_write;
+ perm->readfn = vfio_pm_config_read;

/*
* We always virtualize the next field so we can remove
@@ -1921,13 +2044,31 @@ ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
size_t done = 0;
int ret = 0;
loff_t pos = *ppos;
+ bool runtime_put_required = false;

pos &= VFIO_PCI_OFFSET_MASK;

+ /*
+ * For virtualized bits read/write, the device should not be resumed.
+ * Increment the device usage count alone so that the device won't be
+ * runtime suspended during config read/write.
+ */
+ if (vdev->pm_runtime_suspended) {
+ pm_runtime_get_noresume(&vdev->pdev->dev);
+ runtime_put_required = true;
+ }
+
while (count) {
ret = vfio_config_do_rw(vdev, buf, count, &pos, iswrite);
- if (ret < 0)
+ if (ret < 0) {
+ /*
+ * Decrement the device usage counter corresponding to
+ * previous pm_runtime_get_noresume().
+ */
+ if (runtime_put_required)
+ pm_runtime_put_autosuspend(&vdev->pdev->dev);
return ret;
+ }

count -= ret;
done += ret;
@@ -1937,5 +2078,12 @@ ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,

*ppos += done;

+ /*
+ * Decrement the device usage counter corresponding to previous
+ * pm_runtime_get_noresume().
+ */
+ if (runtime_put_required)
+ pm_runtime_put_autosuspend(&vdev->pdev->dev);
+
return done;
}
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 511a52e92b32..79fa86914b6c 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -187,57 +187,6 @@ static bool vfio_pci_nointx(struct pci_dev *pdev)
return false;
}

-static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev)
-{
- struct pci_dev *pdev = vdev->pdev;
- u16 pmcsr;
-
- if (!pdev->pm_cap)
- return;
-
- pci_read_config_word(pdev, pdev->pm_cap + PCI_PM_CTRL, &pmcsr);
-
- vdev->needs_pm_restore = !(pmcsr & PCI_PM_CTRL_NO_SOFT_RESET);
-}
-
-/*
- * pci_set_power_state() wrapper handling devices which perform a soft reset on
- * D3->D0 transition. Save state prior to D0/1/2->D3, stash it on the vdev,
- * restore when returned to D0. Saved separately from pci_saved_state for use
- * by PM capability emulation and separately from pci_dev internal saved state
- * to avoid it being overwritten and consumed around other resets.
- */
-int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state)
-{
- struct pci_dev *pdev = vdev->pdev;
- bool needs_restore = false, needs_save = false;
- int ret;
-
- if (vdev->needs_pm_restore) {
- if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) {
- pci_save_state(pdev);
- needs_save = true;
- }
-
- if (pdev->current_state >= PCI_D3hot && state <= PCI_D0)
- needs_restore = true;
- }
-
- ret = pci_set_power_state(pdev, state);
-
- if (!ret) {
- /* D3 might be unsupported via quirk, skip unless in D3 */
- if (needs_save && pdev->current_state >= PCI_D3hot) {
- vdev->pm_save = pci_store_saved_state(pdev);
- } else if (needs_restore) {
- pci_load_and_free_saved_state(pdev, &vdev->pm_save);
- pci_restore_state(pdev);
- }
- }
-
- return ret;
-}
-
int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
{
struct pci_dev *pdev = vdev->pdev;
@@ -323,6 +272,16 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
/* For needs_reset */
lockdep_assert_held(&vdev->vdev.dev_set->lock);

+ /*
+ * The vfio device user can close the device after putting the device
+ * into runtime suspended state so wake up the device first in
+ * this case.
+ */
+ if (vdev->pm_runtime_suspended) {
+ vdev->pm_runtime_suspended = false;
+ pm_runtime_get_sync(&pdev->dev);
+ }
+
/* Stop the device from further DMA */
pci_clear_master(pdev);

@@ -1809,7 +1768,6 @@ void vfio_pci_core_uninit_device(struct vfio_pci_core_device *vdev)
mutex_destroy(&vdev->vma_lock);
vfio_uninit_group_dev(&vdev->vdev);
kfree(vdev->region);
- kfree(vdev->pm_save);
}
EXPORT_SYMBOL_GPL(vfio_pci_core_uninit_device);

@@ -1855,8 +1813,6 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
if (ret)
goto out_vf;

- vfio_pci_probe_power_state(vdev);
-
/*
* pci-core sets the device power state to an unknown value at
* bootup and after being removed from a driver. The only
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index aafe09c9fa64..2b1a556ce73f 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -123,9 +123,8 @@ struct vfio_pci_core_device {
bool has_vga;
bool needs_reset;
bool nointx;
- bool needs_pm_restore;
+ bool pm_runtime_suspended;
struct pci_saved_state *pci_saved_state;
- struct pci_saved_state *pm_save;
int ioeventfds_nr;
struct eventfd_ctx *err_trigger;
struct eventfd_ctx *req_trigger;
--
2.17.1