RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

From: Tian, Kevin
Date: Mon Nov 09 2020 - 02:59:29 EST


> From: Raj, Ashok <ashok.raj@xxxxxxxxx>
> Sent: Monday, November 9, 2020 7:59 AM
>
> Hi Thomas,
>
> [-] Jing, She isn't working at Intel anymore.
>
> Now this is getting compiled as a book :-).. Thanks a ton!
>
> One question on the hypercall case that isn't immediately
> clear to me.
>
> On Sun, Nov 08, 2020 at 07:47:24PM +0100, Thomas Gleixner wrote:
> >
> >
> > Now if we look at the virtualization scenario and device hand through
> > then the structure in the guest view is not any different from the basic
> > case. This works with PCI-MSI[X] and the IDXD IMS variant because the
> > hypervisor can trap the access to the storage and translate the message:
> >
> > |
> > |
> > [CPU] -- [Bri | dge] -- Bus -- [Device]
> > |
> > Alloc +
> > Compose Store Use
> > |
> > | Trap
> > v
> > Hypervisor translates and stores
> >
>
> The above case, VMM is responsible for writing to the message
> store. In both cases if its IMS or Legacy MSI/MSIx. VMM handles
> the writes to the device interrupt region and to the IRTE tables.
>
> > But obviously with an IMS storage location which is software controlled
> > by the guest side driver (the case Jason is interested in) the above
> > cannot work for obvious reasons.
> >
> > That means the guest needs a way to ask the hypervisor for a proper
> > translation, i.e. a hypercall. Now where to do that? Looking at the
> > above remapping case it's pretty obvious:
> >
> >
> > |
> > |
> > [CPU] -- [VI | RT] -- [Bridge] -- Bus -- [Device]
> > |
> > Alloc "Compose" Store Use
> >
> > Vectordomain HCALLdomain Busdomain
> > | ^
> > | |
> > v |
> > Hypervisor
> > Alloc + Compose
> >
> > Why? Because it reflects the boundaries and leaves the busdomain part
> > agnostic as it should be. And it works for _all_ variants of Busdomains.
> >
> > Now the question which I can't answer is whether this can work correctly
> > in terms of isolation. If the IMS storage is in guest memory (queue
> > storage) then the guest driver can obviously write random crap into it
> > which the device will happily send. (For MSI and IDXD style IMS it
> > still can trap the store).
>
> The isolation problem is not just the guest memory being used as interrrupt
> store right? If the Store to device region is not trapped and controlled by
> VMM, there is no gaurantee the guest OS has done the right thing?
>
>
> Thinking about it, guest memory might be more problematic since its not
> trappable and VMM can't enforce what is written. This is something that
> needs more attension. But for now the devices supporting memory on device
> the trap and store by VMM seems to satisfy the security properties you
> highlight here.
>

Just want to clarify the trap part.

Guest memory is not trappable in Jason's example, which has queue/IMS
storage swapped between device/memory and requires special command
to sync the state.

But there is also other forms of in-memory IMS implementation. e.g. Some
devices serve work requests based on command buffers instead of HW work
queues. The command buffers are linked in per-process contexts (both in
memory) thus similarly IMS could be stored in each context too. There is no
swap per se. The context is allocated by the driver and then registered to
the device through a mgmt. interface. When the mgmt. interface is mediated,
the hypervisor knows the IMS location and could mark it as read-only in
EPT page table to enable trapping of guest writes. Of course this approach
is awkward if the complexity is paid just for virtualizing IMS.

Thanks
Kevin