RE: Virtualizing MSI-X on IMS via VFIO

From: Tian, Kevin
Date: Wed Jun 23 2021 - 02:16:36 EST


> From: Jiang, Dave <dave.jiang@xxxxxxxxx>
> Sent: Tuesday, June 22, 2021 11:51 PM
>
> On 6/22/2021 3:16 AM, Tian, Kevin wrote:
> > Hi, Alex,
> >
> > Need your help to understand the current MSI-X virtualization flow in
> > VFIO. Some background info first.
> >
> > Recently we are discussing how to virtualize MSI-X with Interrupt
> > Message Storage (IMS) on mdev:
> > https://lore.kernel.org/kvm/87im2lyiv6.ffs@xxxxxxxxxxxxxxxxxxxxxxx/
> >
> > IMS is a device specific interrupt storage, allowing an optimized and
> > scalable manner for generating interrupts. idxd mdev exposes virtual
> > MSI-X capability to guest but uses IMS entries physically for generating
> > interrupts.
> >
> > Thomas has helped implement a generic ims irqchip driver:
> > https://lore.kernel.org/linux-
> hyperv/20200826112335.202234502@xxxxxxxxxxxxx/
> >
> > idxd device allows software to specify an IMS entry (for triggering
> > completion interrupt) when submitting a descriptor. To prevent one
> > mdev triggering malicious interrupt into another mdev (by specifying
> > an arbitrary entry), idxd ims entry includes a PASID field for validation -
> > only a matching PASID in the executed descriptor can trigger interrupt
> > via this entry. idxd driver is expected to program ims entries with
> > PASIDs that are allocated to the mdev which owns those entries.
> >
> > Other devices may have different ID and format to isolate ims entries.
> > But we need abstract a generic means for programming vendor-specific
> > ID into vendor-specific ims entry, without violating the layering model.
> >
> > Thomas suggested vendor driver to first register ID information (possibly
> > plus the location where to write ID to) in msi_desc when allocating irqs
> > (extend existing alloc function or via new helper function) and then have
> > the generic ims irqchip driver to update ID to the ims entry when it's
> > started up by request_irq().
> >
> > Then there are two questions to be answered:
> >
> > 1) How does vendor driver decide the ID to be registered to msi_desc?
> > 2) How is Thomas's model mapped to the MSI-X virtualization flow in
> VFIO?
> >
> > For the 1st open, there are two types of PASIDs on idxd mdev:
> >
> > 1) default PASID: one per mdev and allocated when mdev is created;
> > 2) sva PASIDs: multiple per mdev and allocated on-demand (via
> vIOMMU);
> >
> > If vIOMMU is not exposed, all ims entries of this mdev should be
> > programmed with default PASID which is always available in mdev's
> > lifespan.
> >
> > If vIOMMU is exposed and guest sva is enabled, entries used for sva
> > should be tagged with sva PASIDs, leaving others tagged with default
> > PASID. To help achieve intra-guest interrupt isolation, guest idxd driver
> > needs program guest sva PASIDs into virtual MSIX_PERM register (one
> > per MSI-X entry) for validation. Access to MSIX_PERM is trap-and-emulated
> > by host idxd driver which then figure out which PASID to register to
> > msi_desc (require PASID translation info via new /dev/iommu proposal).
> >
> > The guest driver is expected to update MSIX_PERM before request_irq().
> >
> > Now the 2nd open requires your help. Below is what I learned from
> > current vfio/qemu code (for vfio-pci device):
> >
> > 0) Qemu doesn't attempt to allocate all irqs as reported by msix->
> > table_size. It is done in an dynamic and incremental way.
> >
> > 1) VFIO provides just one command (VFIO_DEVICE_SET_IRQS) for
> > allocating/enabling irqs given a set of vMSIX vectors [start, count]:
> >
> > a) if irqs not allocated, allocate irqs [start+count]. Enable irqs for
> > specified vectors [start, count] via request_irq();
> > b) if irqs already allocated, enable irqs for specified vectors;
> > c) if irq already enabled, disable and re-enable irqs for specified
> > vectors because user may specify a different eventfd;
> >
> > 2) When guest enables virtual MSI-X capability, Qemu calls VFIO_
> > DEVICE_SET_IRQS to enable vector#0, even though it's currently
> > masked by the guest. Interrupts are received by Qemu but blocked
> > from guest via mask/pending bit emulation. The main intention is
> > to enable physical MSI-X;
> >
> > 3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_
> > DEVICE_SET_IRQS to enable vector#0 again, with a eventfd different
> > from the one provided in 2);
> >
> > 4) When guest unmasks vector#1, Qemu finds it's outside of allocated
> > vectors (only vector#0 now):
> >
> > a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free
> > irq for vector#0;
> >
> > b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable
> > irqs for both vector#0 and vector#1;
> >
> > 5) When guest unmasks vector#2, same flow in 4) continues.
> >
> > ....
> >
> > If above understanding is correct, how is lost interrupt avoided between
> > 4.a) and 4.b) given that irq has been torn down for vector#0 in the middle
> > while from guest p.o.v this vector is actually unmasked? There must be
> > a mechanism in place, but I just didn't figure it out...
> >
> > Given above flow is robust, mapping Thomas's model to this flow is
> > straightforward. Assume idxd mdev has two vectors: vector#0 for
> > misc/error interrupt and vector#1 as completion interrupt for guest
> > sva. VFIO_DEVICE_SET_IRQS is handled by idxd mdev driver:
> >
> > 2) When guest enables virtual MSI-X capability, Qemu calls VFIO_
> > DEVICE_SET_IRQS to enable vector#0. Because vector#0 is not
> > used for sva, MSIX_PERM#0 has PASID disabled. Host idxd driver
> > knows to register default PASID to msi_desc#0 when allocating irqs.
> > Then .startup() callback of ims irqchip is called to program default
> > PASID saved in msi_desc#0 to the target ims entry when request_irq().
> >
> > 3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_
> > DEVICE_SET_IRQS to enable vector#0 again. Following same logic
> > as vfio-pci, idxd driver first disable irq#0 via free_irq() and then
> > re-enable irq#0 via request_irq(). It's still default PASID being used
> > according to msi_desc#0.
>
> Hi Kevin, slight correction here. Because vector#0 is emulated for idxd
> vdev, it has no IMS backing. So there is no msi_desc#0 for that vector.
> msi_desc#0 actually starts at vector#1 where IMS is allocated to back
> it. vector#0 does not go through request_irq(). It only has eventfd
> part. Everything you say is correct but starts at vector#1.
>

You are right. But for illustration simplicity, let's still assume both vector
#0 and #1 are backed by ims in following discussion, since purely emulated
vector is anyway outside of this context. 😊

Thanks
Kevin