Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
From: Jason Gunthorpe
Date: Sun Nov 08 2020 - 18:41:55 EST
On Sun, Nov 08, 2020 at 10:11:24AM -0800, Raj, Ashok wrote:
> > On (kvm) virtualization the addr/data pair the IRQ domain hands out
> > doesn't work. It is some fake thing.
>
> Is it really some fake thing? I thought the vCPU and vector are real
> for a guest, and VMM ensures when interrupts are delivered they are either.
It is fake in the sense it is programmed into no hardware.
It is real in the sense it is an ABI contract with the VMM.
> > On something like IDXD this emulation is not so hard, on something
> > like mlx5 this is completely unworkable. Further we never do
> > emulation on our devices, they always pass native hardware through,
> > even for SIOV-like cases.
>
> So is that true for interrupts too?
There is no *mlx5* emulation. We ride on the generic MSI emulation KVM
is going.
> Possibly you have the interrupt entries sitting in memory resident
> on the device?
For SRIOV, yes. The appeal of IMS is to move away from that.
> Don't we need the VMM to ensure they are brokered by VMM in either
> one of the two ways above?
Yes, no matter what the VMM has to know the guest wants an interrupt
routed in and setup the VMM part of the equation. With SRIOV this is
all done with the MSI trapping.
> What if the guest creates some addr in the 0xfee... range how do we
> take care of interrupt remapping and such without any VMM assist?
Not sure I understand this?
> That's true. Probably this can work the same even for MSIx types too then?
Yes, once you have the ability to hypercall to create the addr/data
pair then it can work with MSI and the VMM can stop emulation. It
would be a nice bit of uniformity to close this, but switching the VMM
from legacy to new mode is going to be tricky, I fear.
> I agree with the overall idea and we should certainly take that into
> consideration when we need IMS in guest support and in context of
> interrupt remapping.
The issue with things, as they sit now, is SRIOV.
If any driver starts using pci_subdevice_msi_create_irq_domain() then
it fails if the VF is assigned to a guest with SRVIO. This is a real
and important, use case for many devices today!
The "solution" can't be to go back and retroactively change every
shipping device to add PCI capability blocks, and ensure that every
existing VMM strips them out before assigning the device (including
Hyper-V!!) :(
Jason