Virtualizing MSI-X on IMS via VFIO

From: Tian, Kevin
Date: Tue Jun 22 2021 - 06:16:33 EST


Hi, Alex,

Need your help to understand the current MSI-X virtualization flow in
VFIO. Some background info first.

Recently we are discussing how to virtualize MSI-X with Interrupt
Message Storage (IMS) on mdev:
https://lore.kernel.org/kvm/87im2lyiv6.ffs@xxxxxxxxxxxxxxxxxxxxxxx/

IMS is a device specific interrupt storage, allowing an optimized and
scalable manner for generating interrupts. idxd mdev exposes virtual
MSI-X capability to guest but uses IMS entries physically for generating
interrupts.

Thomas has helped implement a generic ims irqchip driver:
https://lore.kernel.org/linux-hyperv/20200826112335.202234502@xxxxxxxxxxxxx/

idxd device allows software to specify an IMS entry (for triggering
completion interrupt) when submitting a descriptor. To prevent one
mdev triggering malicious interrupt into another mdev (by specifying
an arbitrary entry), idxd ims entry includes a PASID field for validation -
only a matching PASID in the executed descriptor can trigger interrupt
via this entry. idxd driver is expected to program ims entries with
PASIDs that are allocated to the mdev which owns those entries.

Other devices may have different ID and format to isolate ims entries.
But we need abstract a generic means for programming vendor-specific
ID into vendor-specific ims entry, without violating the layering model.

Thomas suggested vendor driver to first register ID information (possibly
plus the location where to write ID to) in msi_desc when allocating irqs
(extend existing alloc function or via new helper function) and then have
the generic ims irqchip driver to update ID to the ims entry when it's
started up by request_irq().

Then there are two questions to be answered:

1) How does vendor driver decide the ID to be registered to msi_desc?
2) How is Thomas's model mapped to the MSI-X virtualization flow in VFIO?

For the 1st open, there are two types of PASIDs on idxd mdev:

1) default PASID: one per mdev and allocated when mdev is created;
2) sva PASIDs: multiple per mdev and allocated on-demand (via vIOMMU);

If vIOMMU is not exposed, all ims entries of this mdev should be
programmed with default PASID which is always available in mdev's
lifespan.

If vIOMMU is exposed and guest sva is enabled, entries used for sva
should be tagged with sva PASIDs, leaving others tagged with default
PASID. To help achieve intra-guest interrupt isolation, guest idxd driver
needs program guest sva PASIDs into virtual MSIX_PERM register (one
per MSI-X entry) for validation. Access to MSIX_PERM is trap-and-emulated
by host idxd driver which then figure out which PASID to register to
msi_desc (require PASID translation info via new /dev/iommu proposal).

The guest driver is expected to update MSIX_PERM before request_irq().

Now the 2nd open requires your help. Below is what I learned from
current vfio/qemu code (for vfio-pci device):

0) Qemu doesn't attempt to allocate all irqs as reported by msix->
table_size. It is done in an dynamic and incremental way.

1) VFIO provides just one command (VFIO_DEVICE_SET_IRQS) for
allocating/enabling irqs given a set of vMSIX vectors [start, count]:

a) if irqs not allocated, allocate irqs [start+count]. Enable irqs for
specified vectors [start, count] via request_irq();
b) if irqs already allocated, enable irqs for specified vectors;
c) if irq already enabled, disable and re-enable irqs for specified
vectors because user may specify a different eventfd;

2) When guest enables virtual MSI-X capability, Qemu calls VFIO_
DEVICE_SET_IRQS to enable vector#0, even though it's currently
masked by the guest. Interrupts are received by Qemu but blocked
from guest via mask/pending bit emulation. The main intention is
to enable physical MSI-X;

3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_
DEVICE_SET_IRQS to enable vector#0 again, with a eventfd different
from the one provided in 2);

4) When guest unmasks vector#1, Qemu finds it's outside of allocated
vectors (only vector#0 now):

a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free
irq for vector#0;

b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable
irqs for both vector#0 and vector#1;

5) When guest unmasks vector#2, same flow in 4) continues.

....

If above understanding is correct, how is lost interrupt avoided between
4.a) and 4.b) given that irq has been torn down for vector#0 in the middle
while from guest p.o.v this vector is actually unmasked? There must be
a mechanism in place, but I just didn't figure it out...

Given above flow is robust, mapping Thomas's model to this flow is
straightforward. Assume idxd mdev has two vectors: vector#0 for
misc/error interrupt and vector#1 as completion interrupt for guest
sva. VFIO_DEVICE_SET_IRQS is handled by idxd mdev driver:

2) When guest enables virtual MSI-X capability, Qemu calls VFIO_
DEVICE_SET_IRQS to enable vector#0. Because vector#0 is not
used for sva, MSIX_PERM#0 has PASID disabled. Host idxd driver
knows to register default PASID to msi_desc#0 when allocating irqs.
Then .startup() callback of ims irqchip is called to program default
PASID saved in msi_desc#0 to the target ims entry when request_irq().

3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_
DEVICE_SET_IRQS to enable vector#0 again. Following same logic
as vfio-pci, idxd driver first disable irq#0 via free_irq() and then
re-enable irq#0 via request_irq(). It's still default PASID being used
according to msi_desc#0.

4) When guest unmasks vector#1, Qemu finds it's outside of allocated
vectors (only vector#0 now):

a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free
irq for vector#0. msi_desc#0 is also freed.

b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable
irqs for both vector#0 and vector#1. At this point, MSIX_PERM#0
has PASID disabled while MSIX_PERM#1 has a valid guest PASID1
for sva. idxd driver registers default PASID to msix_desc#0 and
host PASID2 (translated from guest PASID1) to msix_desc#1 when
allocating irqs. Later when both irqs are enabled via request_irq(),
ims irqchip driver updates the target ims entries according to
msix_desc#0 and misx_desc#1 respectively.

But this is specific to how Qemu virtualizes MSI-X today. What about it
may change (or another device model) to allocate all table_size irqs
when guest enables MSI-X capability? At that point we don't have valid
MSIX_PERM content to register PASID info to msix_desc. Possibly what
we really require is a separate helper function allowing driver to update
msix_desc after irq allocation, e.g. when guest unmasks a vector...

and do you see any other facets which are overlooked here?

Thanks
Kevin