RE: Virtualizing MSI-X on IMS via VFIO

From: Tian, Kevin
Date: Wed Jun 23 2021 - 02:12:17 EST


> From: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Sent: Wednesday, June 23, 2021 7:59 AM
>
[...]
> I only can assume that back then the main problem was vector exhaustion
> on the host and to avoid allocating memory for interrupt descriptors
> etc, right?
>
> The host vector exhaustion problem was that each MSIX vector consumed a
> real CPU vector which is a limited resource on X86. This is not longer
> the case today:
>
> 1) pci_msix_enable[range]() consumes exactly zero CPU vectors from
> the allocatable range independent of the number of MSIX vectors
> it allocates, unless it is in multi-queue managed mode where it
> will actually reserve a vector (maybe two) per CPU.
>
> But for devices which are not using that mode, they just
> opportunistically "reserve" vectors.
>
> All entries are initialized with a special system vector which
> when raised will emit a nastigram in dmesg.
>
> 2) request_irq() actually allocates a CPU vector from the
> allocatable vector space which can obviously still fail, which is
> perfectly fine.
>
> So the only downside today of allocating more MSI-X vectors than
> necessary is memory consumption for the irq descriptors.

Curious about irte entry when IRQ remapping is enabled. Is it also
allocated at request_irq()?

>
> Though for virtualization there is still another problem:
>
> Even if all possible MSI-X vectors for a passthrough PCI device would
> be allocated upfront independent of the actual usage in the guest,
> then there is still the issue of request_irq() failing on the host
> once the guest decides to use any of those interrupts.
>
> It's a halfways reasonable argumentation by some definition of
> reasonable, that this case would be a host system configuration problem
> and the admin who overcommitted is responsible for the consequence.
>
> Where the only reasonable consequence is to kill the guest right there
> because there is no mechanism to actually tell it that the host ran out
> of resources.
>
> Not at all a pretty solution, but it is contrary to the status quo well
> defined. The most important aspect is that it is well defined for the
> case of success:
>
> If it succeeds then there is no way that already requested interrupts
> can be lost or end up being redirected to the legacy PCI irq due to
> clearing the MSIX enable bit, which is a gazillion times better than
> the "let's hope it works" based tinkerware we have now.

fair enough.

>
> So, aside of the existing VFIO/PCI/MSIX thing being just half thought
> out, even thinking about proliferating this with IMS is bonkers.
>
> IMS is meant to avoid the problem of MSI-X which needs to disable MSI-X
> in order to expand the number of vectors. The use cases are going to be
> even more dynamic than the usual device drivers, so the lost interrupt
> issue will be much more likely to trigger.
>
> So no, we are not going to proliferate this complete ignorance of how
> MSI-X actually works and just cram another "feature" into code which is
> known to be incorrect.
>

So the correct flow is like below:

guest::enable_msix()
trapped_by_host()
pci_alloc_irq_vectors(); // for all possible vMSI-X entries
pci_enable_msix();

guest::unmask()
trapped_by_host()
request_irqs();

the first trap calls a new VFIO ioctl e.g. VFIO_DEVICE_ALLOC_IRQS.

the 2nd trap can reuse existing VFIO_DEVICE_SET_IRQS which just
does request_irq() if specified irqs have been allocated.

Then map ims to this flow:

guest::enable_msix()
trapped_by_host()
msi_domain_alloc_irqs(); // for all possible vMSI-X entries
for_all_allocated_irqs(i)
pci_update_msi_desc_id(i, default_pasid); // a new helper func

guest::unmask(entry#0)
trapped_by_host()
request_irqs();
ims_array_irq_startup(); // write msi_desc.id (default_pasid) to ims entry

guest::set_msix_perm(entry#1, guest_sva_pasid)
trapped_by_host()
pci_update_msi_desc_id(1, host_sva_pasid);

guest::unmask(entry#1)
trapped_by_host()
request_irqs();
ims_array_irq_startup(); // write msi_desc.id (host_sva_pasid) to ims entry

Does above match your thoughts?

Thanks
Kevin