Re: [RFC 3/3] vfio/pci: use runtime PM for vfio-device into low power state

From: Alex Williamson
Date: Fri Nov 19 2021 - 10:45:56 EST


On Thu, 18 Nov 2021 14:09:13 -0700
Alex Williamson <alex.williamson@xxxxxxxxxx> wrote:

> On Thu, 18 Nov 2021 20:51:41 +0530
> Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote:
> > On 11/17/2021 11:23 PM, Alex Williamson wrote:
> > Thanks Alex for checking this series and providing your inputs.
> >
> > > If we're transitioning a device to D3cold rather than D3hot as
> > > requested by userspace, isn't that a user visible change?
> >
> > For most of the driver, in linux kernel, the D3hot vs D3cold
> > state will be decided at PCI core layer. In the PCI core layer,
> > pci_target_state() determines which D3 state to choose. It checks
> > for platform_pci_power_manageable() and then it calls
> > platform_pci_choose_state() to find the target state.
> > In VM, the platform_pci_power_manageable() check will fail if the
> > guest is linux OS. So, it uses, D3hot state.
>
> Right, but my statement is really more that the device PM registers
> cannot be used to put the device into D3cold, so the write of the PM
> register that we're trapping was the user/guest's intention to put the
> device into D3hot. We therefore need to be careful about differences
> in the resulting device state when it comes out of D3cold vs D3hot.
>
> > But there are few drivers which does not use the PCI framework
> > generic power related routines during runtime suspend/system suspend
> > and set the PCI power state directly with D3hot.
>
> Current vfio-pci being one of those ;)
>
> > Also, the guest can be non-Linux OS also and, in that case,
> > it will be difficult to know the behavior. So, it may impact
> > these cases.
>
> That's what I'm worried about.
>
> > > For instance, a device may report NoSoftRst- indicating that the device
> > > does not do a soft reset on D3hot->D0 transition. If we're instead
> > > putting the device in D3cold, then a transition back to D0 has very
> > > much undergone a reset. On one hand we should at least virtualize the
> > > NoSoftRst bit to allow the guest to restore the device, but I wonder if
> > > that's really safe. Is a better option to prevent entering D3cold if
> > > the device isn't natively reporting NoSoftRst-?
> > >
> >
> > You mean to say NoSoftRst+ instead of NoSoftRst- as visible in
>
> Oops yes. The concern is if the user/guest is not expecting a soft
> reset when using D3hot, but we transparently promote D3hot to D3cold
> which will always implies a device reset.
>
> > the lspci output. For NoSoftRst- case, we do a soft reset on
> > D3hot->D0 transition. But, will this case not be handled internally
> > in drivers/pci/pci-driver.c ? For both system suspend and runtime suspend,
> > we check for pci_dev->state_saved flag and do pci_save_state()
> > irrespective of NoSoftRst bit. For NoSoftRst- case, pci_restore_bars()
> > will be called in pci_raw_set_power_state() which will reinitialize device
> > for D3hot/D3cold-> D0 case. Once the device is initialized in the host,
> > then for guest, it should work without re-initializing again in the
> > guest side. I am not sure, if my understanding is correct.
>
> The soft reset is not limited to the state that the PCI subsystem can
> save and restore. Device specific state that the user/guest may
> legitimately expect to be retained may be reset as well.
>
> [PCIe v5 5.3.1.4]
> Functional context is required to be maintained by Functions in
> the D3 hot state if the No_Soft_Reset field in the PMCSR is Set.
>
> Unfortunately I don't see a specific definition of "functional
> context", but I interpret that to include device specific state. For
> example, if a GPU contains specific frame buffer data and reports
> NoSoftRst+, wouldn't it be reasonable to expect that framebuffer data
> to be retained on D3hot->D0 transition?
>
> > > We're also essentially making a policy decision on behalf of
> > > userspace that favors power saving over latency. Is that
> > > universally the correct trade-off?
> >
> > For most drivers, the D3hot vs D3cold should not be favored due
> > to latency reasons. In the linux kernel side, I am seeing, the
> > PCI framework try to use D3cold state if platform and device
> > supports that. But its correct that covertly replacing D3hot with
> > D3cold may be concern for some drivers.
> >
> > > I can imagine this could be desirable for many use cases,
> > > but if we're going to covertly replace D3hot with D3cold, it seems
> > > like there should be an opt-in. Is co-opting the PM capability for
> > > this even really acceptable or should there be a device ioctl to
> > > request D3cold and plumbing through QEMU such that a VM guest can
> > > make informed choices regarding device power management?
> > >
> >
> > Making IOCTL is also an option but that case, this support needs to
> > be added in all hypervisors and user must pass this information
> > explicitly for each device. Another option could be to use
> > module parameter to explicitly enable D3cold support. If module
> > parameter is not set, then we can call pci_d3cold_disable() and
> > in that case, runtime PM should not use D3cold state.
> >
> > Also, I was checking we can pass this information though some
> > virtualized register bit which will be only defined for passing
> > the information between guest and host. In the guest side if the
> > target state is being decided with pci_target_state(), then
> > the D3cold vs D3hot should not matter for the driver running
> > in the guest side and in that case, it depends upon platform support.
> > We can set this virtualize bit to 1. But, if driver is either
> > setting D3hot state explicitly or has called pci_d3cold_disable() or
> > similar API available in the guest OS, then set this bit to 0 and
> > in that case, the D3cold state can be disabled in the host side.
> > But don't know if is possible to use some non PCI defined
> > virtualized register bit.
>
> If you're suggesting a device config space register, that's troublesome
> because we can't guarantee that simply because a range of config space
> isn't within a capability that it doesn't have some device specific
> purpose. However, we could certainly implement virtual registers in
> the hypervisor that implement the ACPI support that an OS would use on
> bare metal to implement D3cold. Those could trigger this ioctl through
> the vfio device.
>
> > I am not sure what should be best option to make choice
> > regarding d3cold but if we can have some option by which this
> > can be done without involvement of user, then it will benefit
> > for lot of cases. Currently, the D3cold is supported only in
> > very few desktops/servers but in future, we will see on
> > most of the platforms.
>
> I tend to see it as an interesting hack to promote D3hot to D3cold, and
> potentially very useful. However, we're also introducing potentially
> unexpected device behavior, so I think it would probably need to be an
> opt-in. Possibly if the device reports NoSoftRst- we could use it by
> default, but even virtualizing the NoSoftRst suggests that there's an
> expectation that the guest driver has that support available.
>
> > > Also if the device is not responsive to config space due to the user
> > > placing it in D3 now, I'd expect there are other ioctl paths that
> > > need to be blocked, maybe even MMIO paths that might be a gap for
> > > existing D3hot support. Thanks,
> >
> > I was in assumption that most of IOCTL code will be called by the
> > hypervisor before guest OS boot and during that time, the device
> > will be always in D0. But, if we have paths where IOCTL can be
> > called when the device has been suspended by guest OS, then can we
> > use runtime_get/put API’s there also ?
>
> It's more a matter of preventing user actions that can cause harm
> rather than expecting certain operations only in specific states. We
> could chose to either resume the device for those operations or fail
> the operation. We should probably also leverage the memory-disable
> support to fault mmap access to MMIO when the device is in D3* as well.

It also occurred to me last night that a guest triggering D3hot via the
PM registers must be a synchronous power state change, we can't use
auto-suspend. This is necessary for nested assignment where the guest
might use a D3hot->D0 power state transition with NoSoftRst- devices in
order to perform a reset of the device. With auto-suspend, the guest
would return the device to D0 before the physical device ever timed out
to enter a D3 state. Thanks,

Alex