Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

From: Raj, Ashok
Date: Sun Nov 08 2020 - 18:58:59 EST


Hi Thomas,

[-] Jing, She isn't working at Intel anymore.

Now this is getting compiled as a book :-).. Thanks a ton!

One question on the hypercall case that isn't immediately
clear to me.

On Sun, Nov 08, 2020 at 07:47:24PM +0100, Thomas Gleixner wrote:
>
>
> Now if we look at the virtualization scenario and device hand through
> then the structure in the guest view is not any different from the basic
> case. This works with PCI-MSI[X] and the IDXD IMS variant because the
> hypervisor can trap the access to the storage and translate the message:
>
> |
> |
> [CPU] -- [Bri | dge] -- Bus -- [Device]
> |
> Alloc +
> Compose Store Use
> |
> | Trap
> v
> Hypervisor translates and stores
>

The above case, VMM is responsible for writing to the message
store. In both cases if its IMS or Legacy MSI/MSIx. VMM handles
the writes to the device interrupt region and to the IRTE tables.

> But obviously with an IMS storage location which is software controlled
> by the guest side driver (the case Jason is interested in) the above
> cannot work for obvious reasons.
>
> That means the guest needs a way to ask the hypervisor for a proper
> translation, i.e. a hypercall. Now where to do that? Looking at the
> above remapping case it's pretty obvious:
>
>
> |
> |
> [CPU] -- [VI | RT] -- [Bridge] -- Bus -- [Device]
> |
> Alloc "Compose" Store Use
>
> Vectordomain HCALLdomain Busdomain
> | ^
> | |
> v |
> Hypervisor
> Alloc + Compose
>
> Why? Because it reflects the boundaries and leaves the busdomain part
> agnostic as it should be. And it works for _all_ variants of Busdomains.
>
> Now the question which I can't answer is whether this can work correctly
> in terms of isolation. If the IMS storage is in guest memory (queue
> storage) then the guest driver can obviously write random crap into it
> which the device will happily send. (For MSI and IDXD style IMS it
> still can trap the store).

The isolation problem is not just the guest memory being used as interrrupt
store right? If the Store to device region is not trapped and controlled by
VMM, there is no gaurantee the guest OS has done the right thing?


Thinking about it, guest memory might be more problematic since its not
trappable and VMM can't enforce what is written. This is something that
needs more attension. But for now the devices supporting memory on device
the trap and store by VMM seems to satisfy the security properties you
highlight here.

>
> Is the IOMMU/Interrupt remapping unit able to catch such messages which
> go outside the space to which the guest is allowed to signal to? If yes,
> problem solved. If no, then IMS storage in guest memory can't ever work.

This can probably work for SRIOV devices where guest owns the entire device.
interrupt remap does have RID checks if interrupt arrives at an Interrupt handle
not allocated for that BDF.

But for SIOV devices there is no PASID filtering at the remap level since
interrupt messages don't carry PASID in the TLP.

>
> Coming back to this:
>
> > In the end pci_subdevice_msi_create_irq_domain() is a platform
> > function. Either it should work completely on every device with no
> > device-specific emulation required in the VMM, or it should not work
> > at all and return -EOPNOTSUPP.
>
> The subdevice domain is a 'Busdomain' according to the structure
> above. It does not and should never have any clue about the underlying
> system. It's in the agnostic part and always works. It simply does not
> care what's underneath. So it won't return -EOPNOTSUPP.
>
> What it has to do is to transport the IMS in queue memory requirement to
> the underlying parent domain.
>
> So in case that the HCALL domain is missing, the Vector domain needs
> return an error code on domain creation. If the HCALL domain is there
> then the domain creation works and in case of actual interrupt
> allocation the hypercall either returns a valid composed message or an
> appropriate error code.
>
> But there's a catch:
>
> This only works when the guest OS actually knows that it runs in a
> VM. If the guest can't figure that out, i.e. via CPUID, this cannot be

Precicely!. It might work if the OS is new, but for legacy the trap-emulate
seems both safe and works for legacy as well?


Cheers,
Ashok