Re: Summary of LPC guest MSI discussion in Santa Fe

From: Will Deacon
Date: Tue Nov 08 2016 - 12:55:58 EST


On Tue, Nov 08, 2016 at 03:27:23PM +0100, Auger Eric wrote:
> On 08/11/2016 03:45, Will Deacon wrote:
> > Rather than treat these as separate problems, a better interface is to
> > tell userspace about a set of reserved regions, and have this include
> > the MSI doorbell, irrespective of whether or not it can be remapped.
> > Don suggested that we statically pick an address for the doorbell in a
> > similar way to x86, and have the kernel map it there. We could even pick
> > 0xfee00000. If it conflicts with a reserved region on the platform (due
> > to (4)), then we'd obviously have to (deterministically?) allocate it
> > somewhere else, but probably within the bottom 4G.
>
> This is tentatively achieved now with
> [1] [RFC v2 0/8] KVM PCIe/MSI passthrough on ARM/ARM64 - Alt II
> (http://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1264506.html)

Yup, I saw that fly by. Hopefully some of the internals can be reused
with the current thinking on user ABI.

> > The next question is how to tell userspace about all of the reserved
> > regions. Initially, the idea was to extend VFIO, however Alex pointed
> > out a horrible scenario:
> >
> > 1. QEMU spawns a VM on system 0
> > 2. VM is migrated to system 1
> > 3. QEMU attempts to passthrough a device using PCI hotplug
> >
> > In this scenario, the guest memory map is chosen at step (1), yet there
> > is no VFIO fd available to determine the reserved regions. Furthermore,
> > the reserved regions may vary between system 0 and system 1. This pretty
> > much rules out using VFIO to determine the reserved regions.Alex suggested
> > that the SMMU driver can advertise the regions via /sys/class/iommu/. This
> > would solve part of the problem, but migration between systems with
> > different memory maps can still cause problems if the reserved regions
> > of the new system conflict with the guest memory map chosen by QEMU.
>
>
> OK so I understand we do not want anymore the VFIO chain capability API
> (patch 5 of above series) but we prefer a sysfs approach instead.

Right.

> I understand the sysfs approach which allows the userspace to get the
> info earlier and independently on VFIO. Keeping in mind current QEMU
> virt - which is not the only userspace - will not do much from this info
> until we bring upheavals in virt address space management. So if I am
> not wrong, at the moment the main action to be undertaken is the
> rejection of the PCI hotplug in case we detect a collision.

I don't think so; it should be up to userspace to reject the hotplug.
If userspace doesn't have support for the regions, then that's fine --
you just end up in a situation where the CPU page table maps memory
somewhere that the device can't see. In other words, you'll end up with
spurious DMA failures, but that's exactly what happens with current systems
if you passthrough an overlapping region (Robin demonstrated this on Juno).

Additionally, you can imagine some future support where you can tell the
guest not to use certain regions of its memory for DMA. In this case, you
wouldn't want to refuse the hotplug in the case of overlapping regions.

Really, I think the kernel side just needs to enumerate the fixed reserved
regions, place the doorbell at a fixed address and then advertise these
via sysfs.

> I can respin [1]
> - studying and taking into account Robin's comments about dm_regions
> similarities
> - removing the VFIO capability chain and replacing this by a sysfs API

Ideally, this would be reusable between different SMMU drivers so the sysfs
entries have the same format etc.

> Would that be OK?

Sounds good to me. Are you in a position to prototype something on the qemu
side once we've got kernel-side agreement?

> What about Alex comments who wanted to report the usable memory ranges
> instead of unusable memory ranges?
>
> Also did you have a chance to discuss the following items:
> 1) the VFIO irq safety assessment

The discussion really focussed on system topology, as opposed to properties
of the doorbell. Regardless of how the device talks to the doorbell, if
the doorbell can't protect against things like MSI spoofing, then it's
unsafe. My opinion is that we shouldn't allow passthrough by default on
systems with unsafe doorbells (we could piggyback on allow_unsafe_interrupts
cmdline option to VFIO).

A first step would be making all this opt-in, and only supporting GICv3
ITS for now.

> 2) the MSI reserved size computation (is an arbitrary size OK?)

If we fix the base address, we could fix a size too. However, we'd still
need to enumerate the doorbells to check that they fit in the region we
have. If not, then we can warn during boot and treat it the same way as
a resource conflict (that is, reallocate the region in some deterministic
way).

Will