Re: [PATCH v6 15/20] vfio/mdev: idxd: ims domain setup for the vdcm

From: Thomas Gleixner
Date: Mon May 31 2021 - 19:55:43 EST


On Mon, May 31 2021 at 13:57, Jason Gunthorpe wrote:
> On Mon, May 31, 2021 at 04:02:02PM +0200, Thomas Gleixner wrote:
>> > I'm quite surprised that every mdev doesn't create its own ims_domain
>> > in its probe function.
>>
>> What for?
>
> IDXD wouldn't need it, but proper IMS HW with no bound of number of
> vectors can't provide a ims_info.max_slots value here.

There is no need to do so:

https://lore.kernel.org/r/20200826112335.202234502@xxxxxxxxxxxxx

which has the IMS_MSI_QUEUE variant at which you looked at and said:

"I haven't looked through everything in detail, but this does look like
it is good for the mlx5 devices."

ims_info.max_slots is a property of the IMS_MSI_ARRAY and does not make
any restrictions on other storage.

> Instead each use use site, like VFIO, would want to specify the number
> of vectors to allocate for its own usage, then parcel them out one by
> one in the normal way. Basically VFIO is emulating a normal MSI-X
> table.

Just with a size which exceeds a normal MSI-X table, but that's an
implementation detail of the underlying physical device. It does not put
any restrictions on mdev at all.

>> > This places a global total limit on the # of vectors which makes me
>> > ask what was the point of using IMS in the first place ?
>>
>> That depends on how IMS is implemented. The IDXD variant has a fixed
>> sized message store which is shared between all subdevices, so yet
>> another domain would not provide any value.
>
> Right, IDXD would have been perfectly happy to use the normal MSI-X
> table from what I can see.

Again. No, it's a storage size problem and regular MSI-X does not
support auxiliary data.

>> For the case where the IMS store is seperate, you still have one central
>> irqdomain per physical device. The domain allocation function can then
>> create storage on demand or reuse existing storage and just fill in the
>> pointers.
>
> I think it is philosophically backwards, and it is in part what is
> motivating pretending this weird auxdomain and PASID stuff is generic.

That's a different story and as I explained to Dave already hacking all
this into mdev is backwards, but that does not make your idea of a
irqdomain per mdev any more reasonable.

The mdev does not do anything irq chip/domain related. It uses what the
underlying physical device provides. If you think otherwise then please
provide me the hierarchical model which I explained here:

https://lore.kernel.org/r/87h7tcgbs2.fsf@xxxxxxxxxxxxxxxxxxxxxxx
https://lore.kernel.org/r/87bljg7u4f.fsf@xxxxxxxxxxxxxxxxxxxxxxx

> The VFIO model is the IRQ table is associated with a VM. When the
> vfio_device is created it decides how big the MSI-X table will be and
> it needs to allocate a block of interrupts to emulate it. For security
> those interrupts need to be linked in the HW to the vfio_device and
> the VM. ie VM A cannot trigger an interrupt that would deliver to VM
> B.

Fine.

> IDXD choose to use the PASID, but other HW might use a generic VM_ID.

So what?

> Further, IDXD choose to use a VM_ID per IMS entry, but other HW is
> likely to use a VM_ID per block of IMS entries. Ie the HW tree starts
> a VM object, then locates the IMS table for that object, then triggers
> the interrupt.

If you read my other reply to Dave carefuly then you might have noticed
that this is crap and irrelevant because the ID (whatever it is) is per
device and that ID has to be stored in the device. Whether the actual
irq chip/domain driver implementation uses it per associated irq or not
does not matter at all.

> If we think about the later sort of HW I don't think the whole aux
> data and domain per pci function makes alot of sense.

First of all that is already debunked and will go nowhere and second
there is no requirement to implement this for some other incarnation of
IMS when done correctly. That whole irq_set_auxdata() stuff is not going
anywhere simply because it's not needed at all.

All what's needed is a function to store some sort of ID per device
(mdev) and the underlying IMS driver takes care of what to do with it.

That has to happen before the interrupts are allocated and if that info
is invalid then the allocation function can reject it.

> You'd want a domain per VM_ID and all the IMS entires in that domain
> share the same VM_ID. In this regard the irq domain will correspond to
> the security boundary.

The real problems are:

- Intel misled me with the requirement to set PASID after the fact
which is simply wrong and what caused me to come up with that
irq_set_auxdata() workaround.

- Their absolute ignorance for proper layering led to adding all that
irq_set_auxdata() muck to this mdev library.

Ergo, the proper thing to do is to fix this ID storage problem (PASID,
VM_ID or whatever) at the proper place, i.e. store it in struct device
(which is associated to that mdev) and let the individual drivers handle
it as they require.

It's that simple and this needs to be fixed and not some made up
philosophical question about irqdomains per mdev. Those would be even
worse than what Intel did here.

Thanks,

tglx