Re: [RFC PATCH v4 07/10] vfio/pci: introduce a new irq type VFIO_IRQ_TYPE_REMAP_BAR_REGION

From: Yan Zhao
Date: Sun Jun 21 2020 - 23:44:21 EST


On Fri, Jun 19, 2020 at 04:55:34PM -0600, Alex Williamson wrote:
> On Wed, 10 Jun 2020 01:23:14 -0400
> Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote:
>
> > On Fri, Jun 05, 2020 at 10:13:01AM -0600, Alex Williamson wrote:
> > > On Thu, 4 Jun 2020 22:02:31 -0400
> > > Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote:
> > >
> > > > On Wed, Jun 03, 2020 at 10:10:58PM -0600, Alex Williamson wrote:
> > > > > On Wed, 3 Jun 2020 22:42:28 -0400
> > > > > Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote:
> > > > >
> > > > > > On Wed, Jun 03, 2020 at 05:04:52PM -0600, Alex Williamson wrote:
> > > > > > > On Tue, 2 Jun 2020 21:40:58 -0400
> > > > > > > Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote:
> > > > > > >
> > > > > > > > On Tue, Jun 02, 2020 at 01:34:35PM -0600, Alex Williamson wrote:
> > > > > > > > > I'm not at all happy with this. Why do we need to hide the migration
> > > > > > > > > sparse mmap from the user until migration time? What if instead we
> > > > > > > > > introduced a new VFIO_REGION_INFO_CAP_SPARSE_MMAP_SAVING capability
> > > > > > > > > where the existing capability is the normal runtime sparse setup and
> > > > > > > > > the user is required to use this new one prior to enabled device_state
> > > > > > > > > with _SAVING. The vendor driver could then simply track mmap vmas to
> > > > > > > > > the region and refuse to change device_state if there are outstanding
> > > > > > > > > mmaps conflicting with the _SAVING sparse mmap layout. No new IRQs
> > > > > > > > > required, no new irqfds, an incremental change to the protocol,
> > > > > > > > > backwards compatible to the extent that a vendor driver requiring this
> > > > > > > > > will automatically fail migration.
> > > > > > > > >
> > > > > > > > right. looks we need to use this approach to solve the problem.
> > > > > > > > thanks for your guide.
> > > > > > > > so I'll abandon the current remap irq way for dirty tracking during live
> > > > > > > > migration.
> > > > > > > > but anyway, it demos how to customize irq_types in vendor drivers.
> > > > > > > > then, what do you think about patches 1-5?
> > > > > > >
> > > > > > > In broad strokes, I don't think we've found the right solution yet. I
> > > > > > > really question whether it's supportable to parcel out vfio-pci like
> > > > > > > this and I don't know how I'd support unraveling whether we have a bug
> > > > > > > in vfio-pci, the vendor driver, or how the vendor driver is making use
> > > > > > > of vfio-pci.
> > > > > > >
> > > > > > > Let me also ask, why does any of this need to be in the kernel? We
> > > > > > > spend 5 patches slicing up vfio-pci so that we can register a vendor
> > > > > > > driver and have that vendor driver call into vfio-pci as it sees fit.
> > > > > > > We have two patches creating device specific interrupts and a BAR
> > > > > > > remapping scheme that we've decided we don't need. That brings us to
> > > > > > > the actual i40e vendor driver, where the first patch is simply making
> > > > > > > the vendor driver work like vfio-pci already does, the second patch is
> > > > > > > handling the migration region, and the third patch is implementing the
> > > > > > > BAR remapping IRQ that we decided we don't need. It's difficult to
> > > > > > > actually find the small bit of code that's required to support
> > > > > > > migration outside of just dealing with the protocol we've defined to
> > > > > > > expose this from the kernel. So why are we trying to do this in the
> > > > > > > kernel? We have quirk support in QEMU, we can easily flip
> > > > > > > MemoryRegions on and off, etc. What access to the device outside of
> > > > > > > what vfio-pci provides to the user, and therefore QEMU, is necessary to
> > > > > > > implement this migration support for i40e VFs? Is this just an
> > > > > > > exercise in making use of the migration interface? Thanks,
> > > > > > >
> > > > > > hi Alex
> > > > > >
> > > > > > There was a description of intention of this series in RFC v1
> > > > > > (https://www.spinics.net/lists/kernel/msg3337337.html).
> > > > > > sorry, I didn't include it in starting from RFC v2.
> > > > > >
> > > > > > "
> > > > > > The reason why we don't choose the way of writing mdev parent driver is
> > > > > > that
> > > > >
> > > > > I didn't mention an mdev approach, I'm asking what are we accomplishing
> > > > > by doing this in the kernel at all versus exposing the device as normal
> > > > > through vfio-pci and providing the migration support in QEMU. Are you
> > > > > actually leveraging having some sort of access to the PF in supporting
> > > > > migration of the VF? Is vfio-pci masking the device in a way that
> > > > > prevents migrating the state from QEMU?
> > > > >
> > > > yes, communication to PF is required. VF state is managed by PF and is
> > > > queried from PF when VF is stopped.
> > > >
> > > > migration support in QEMU seems only suitable to devices with dirty
> > > > pages and device state available by reading/writing device MMIOs, which
> > > > is not the case for most devices.
> > >
> > > Post code for such a device.
> > >
> > hi Alex,
> > There's an example in i40e vf. virtual channel related resources are in
> > guest memory. dirty page tracking requires the info stored in those
> > guest memory.
> >
> > there're two ways to get the resources addresses:
> > (1) always trap VF registers related. as in Alex Graf's qemu code.
> >
> > starting from beginning, it tracks rw of Admin Queue Configuration registers.
> > Then in the write handler vfio_i40evf_aq_mmio_mem_region_write(), guest
> > commands are processed to record the guest dma addresses of the virtual
> > channel related resources.
> > e.g. vdev->vsi_config is read from the guest dma addr contained in
> > command I40E_VIRTCHNL_OP_CONFIG_VSI_QUEUES.
> >
> >
> > vfio_i40evf_initfn()
> > {
> > ...
> > memory_region_init_io(&vdev->aq_mmio_mem, OBJECT(dev),
> > &vfio_i40evf_aq_mmio_mem_region_ops,
> > vdev, "i40evf AQ config",
> > I40E_VFGEN_RSTAT - I40E_VF_ARQBAH1);
> > ...
> > }
> >
> > vfio_i40evf_aq_mmio_mem_region_write()
> > {
> > ...
> > switch (addr) {
> > case I40E_VF_ARQBAH1:
> > case I40E_VF_ARQBAL1:
> > case I40E_VF_ARQH1:
> > case I40E_VF_ARQLEN1:
> > case I40E_VF_ARQT1:
> > case I40E_VF_ATQBAH1:
> > case I40E_VF_ATQBAL1:
> > case I40E_VF_ATQH1:
> > case I40E_VF_ATQT1:
> > case I40E_VF_ATQLEN1:
> > vfio_i40evf_vw32(vdev, addr, data);
> > vfio_i40e_aq_update(vdev); ==> update & process atq commands
> > break;
> > default:
> > vfio_i40evf_w32(vdev, addr, data);
> > break;
> > }
> > }
> > vfio_i40e_aq_update(vdev)
> > |->vfio_i40e_atq_process_one(vdev, vfio_i40evf_vr32(vdev, I40E_VF_ATQH1)
> > |-> hwaddr addr = vfio_i40e_get_atqba(vdev) + (index * sizeof(desc));
> > | pci_dma_read(pdev, addr, &desc, sizeof(desc));//read guest's command
> > | vfio_i40e_record_atq_cmd(vdev, pdev, &desc)
> >
> >
> >
> > vfio_i40e_record_atq_cmd(...I40eAdminQueueDescriptor *desc) {
> > data_addr = desc->params.external.addr_high;
> > ...
> >
> > switch (desc->cookie_high) {
> > ...
> > case I40E_VIRTCHNL_OP_CONFIG_VSI_QUEUES:
> > pci_dma_read(pdev, data_addr, &vdev->vsi_config,
> > MIN(desc->datalen, sizeof(vdev->vsi_config)));
> > ...
> > }
> > ...
> > }
> >
> >
> > (2) pass through all guest MMIO accesses and only do MMIO trap when migration
> > is about to start.
> > This is the way we're using in the host vfio-pci vendor driver (or mdev parent driver)
> > of i40e vf device (sorry for no public code available still).
> >
> > when migration is about to start, it's already too late to get the guest dma
> > address for those virtual channel related resources merely by MMIO
> > trapping, so we have to ask for them from PF.
> >
> >
> >
> > <...>
> >
> > > > > > for interfaces exported in patch 3/10-5/10, they anyway need to be
> > > > > > exported for writing mdev parent drivers that pass through devices at
> > > > > > normal time to avoid duplication. and yes, your worry about
> > > > >
> > > > > Where are those parent drivers? What are their actual requirements?
> > > > >
> > > > if this way of registering vendor ops to vfio-pci is not permitted,
> > > > vendors have to resort to writing its mdev parent drivers for VFs. Those
> > > > parent drivers need to pass through VFs at normal time, doing exactly what
> > > > vfio-pci does and only doing what vendor ops does during migration.
> > > >
> > > > if vfio-pci could export common code to those parent drivers, lots of
> > > > duplicated code can be avoided.
> > >
> > > There are two sides to this argument though. We could also argue that
> > > mdev has already made it too easy to implement device emulation in the
> > > kernel, the barrier is that such emulation is more transparent because
> > > it does require a fair bit of code duplication from vfio-pci. If we
> > > make it easier to simply re-use vfio-pci for much of this, and even
> > > take it a step further by allowing vendor drivers to masquerade behind
> > > vfio-pci, then we're creating an environment where vendors don't need
> > > to work with QEMU to get their device emulation accepted. They can
> > > write their own vendor drivers, which are now simplified and sanctioned
> > > by exported functions in vfio-pci. They can do this easily and open up
> > > massive attack vectors, hiding behind vfio-pci.
> > >
> > your concern is reasonable.
> >
> > > I know that I was advocating avoiding user driver confusion, ie. does
> > > the user bind a device to vfio-pci, i40e_vf_vfio, etc, but maybe that's
> > > the barrier we need such that a user can make an informed decision
> > > about what they're actually using. If a vendor then wants to implement
> > > a feature in vfio-pci, we'll need to architect an interface for it
> > > rather than letting them pick and choose which pieces of vfio-pci to
> > > override.
> > >
> > > > > > identification of bug sources is reasonable. but if a device is binding
> > > > > > to vfio-pci with a vendor module loaded, and there's a bug, they can do at
> > > > > > least two ways to identify if it's a bug in vfio-pci itself.
> > > > > > (1) prevent vendor modules from loading and see if the problem exists
> > > > > > with pure vfio-pci.
> > > > > > (2) do what's demoed in patch 8/10, i.e. do nothing but simply pass all
> > > > > > operations to vfio-pci.
> > > > >
> > > > > The code split is still extremely ad-hoc, there's no API. An mdev
> > > > > driver isn't even a sub-driver of vfio-pci like you're trying to
> > > > > accomplish here, there would need to be a much more defined API when
> > > > > the base device isn't even a vfio_pci_device. I don't see how this
> > > > > series would directly enable an mdev use case.
> > > > >
> > > > similar to Yi's series https://patchwork.kernel.org/patch/11320841/.
> > > > we can parcel the vdev creation code in vfio_pci_probe() to allow calling from
> > > > mdev parent probe routine. (of course, also need to parcel code to free vdev)
> > > > e.g.
> > > >
> > > > void *vfio_pci_alloc_vdev(struct pci_dev *pdev, const struct pci_device_id *id)
> > > > {
> > > > struct vfio_pci_device *vdev;
> > > > vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> > > > if (!vdev) {
> > > > ret = -ENOMEM;
> > > > goto out_group_put;
> > > > }
> > > >
> > > > vdev->pdev = pdev;
> > > > vdev->irq_type = VFIO_PCI_NUM_IRQS;
> > > > mutex_init(&vdev->igate);
> > > > spin_lock_init(&vdev->irqlock);
> > > > mutex_init(&vdev->ioeventfds_lock);
> > > > INIT_LIST_HEAD(&vdev->ioeventfds_list);
> > > > ...
> > > > vfio_pci_probe_power_state(vdev);
> > > >
> > > > if (!disable_idle_d3) {
> > > > vfio_pci_set_power_state(vdev, PCI_D0);
> > > > vfio_pci_set_power_state(vdev, PCI_D3hot);
> > > > }
> > > > return vdev;
> > > > }
> > > >
> > > > static int vfio_mdev_pci_driver_probe(struct pci_dev *pdev, const struct pci_device_id *id))
> > > > {
> > > >
> > > > void *vdev = vfio_pci_alloc_vdev(pdev, id);
> > > >
> > > > //save the vdev pointer
> > > >
> > > > }
> > > > then all the exported interfaces from this series can also benefit the
> > > > mdev use case.
> > >
> > > You need to convince me that we're not just doing this for the sake of
> > > re-using a migration interface. We do need vendor specific drivers to
> > > support migration, but implementing those vendor specific drivers in
> > > the kernel just because we have that interface is the wrong answer. If
> > > we can implement that device specific migration support in QEMU and
> > > limit the attack surface from the hypervisor or guest into the host
> > > kernel, that's a better answer. As I've noted above, I'm afraid all of
> > > these attempts to parcel out vfio-pci are only going to serve to
> > > proliferate vendor modules that have limited community review, expand
> > > the attack surface, and potentially harm the vfio ecosystem overall
> > > through bad actors and reduced autonomy. Thanks,
> > >
> > The requirement to access PF as mentioned above is one of the reason for
> > us to implement the emulation in kernel.
> > Another reason is that we don't want to duplicate a lot of kernel logic in
> > QEMU as what'd done in Alex Graf's "vfio-i40e". then QEMU has to be
> > updated along kernel driver changing. The effort for maintenance and
> > version matching is a big burden to vendors.
> > But you are right, there're less review in virtualization side to code under
> > vendor specific directory. That's also the pulse for us to propose
> > common helper APIs for them to call, not only for convenience and
> > duplication-less, but also for code with full review.
> >
> > would you mind giving us some suggestions for where to go?
>
> Not duplicating kernel code into userspace isn't a great excuse. What
> we need to do to emulate a device is not an exact mapping to what a
> driver for that device needs to do. If we need to keep the device
> driver and the emulation in sync then we haven't done a good job with
> the emulation. What would it look like if we only had an additional
> device specific region on the vfio device fd we could use to get the
> descriptor information we need from the PF? This would be more inline
> with the quirks we provide for IGD assignment. Thanks,
>
hi Alex
Thanks for this suggestion.
As migration region is a generic vendor region, do you think below way
to specify device specific region is acceptable?

(1) provide/export an interface to let vendor driver register its device
specific region or substitute get_region_info/rw/mmap of existing regions.
(2) export vfio_pci_default_rw(), vfio_pci_default_mmap() to called from
both vendor driver handlers and vfio-pci.

Or do you still prefer to adding quirks per device so you can have a
better review of all code?

we can add a disable flag to disable regions registered/modified by
vendor drivers in bullet (1) for debug purpose.

Thanks
Yan