Re: Summary of LPC guest MSI discussion in Santa Fe

From: Don Dutile
Date: Tue Nov 08 2016 - 14:02:50 EST


On 11/08/2016 12:54 PM, Will Deacon wrote:
On Tue, Nov 08, 2016 at 03:27:23PM +0100, Auger Eric wrote:
On 08/11/2016 03:45, Will Deacon wrote:
Rather than treat these as separate problems, a better interface is to
tell userspace about a set of reserved regions, and have this include
the MSI doorbell, irrespective of whether or not it can be remapped.
Don suggested that we statically pick an address for the doorbell in a
similar way to x86, and have the kernel map it there. We could even pick
0xfee00000. If it conflicts with a reserved region on the platform (due
to (4)), then we'd obviously have to (deterministically?) allocate it
somewhere else, but probably within the bottom 4G.
This is tentatively achieved now with
[1] [RFC v2 0/8] KVM PCIe/MSI passthrough on ARM/ARM64 - Alt II
(http://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1264506.html)
Yup, I saw that fly by. Hopefully some of the internals can be reused
with the current thinking on user ABI.

The next question is how to tell userspace about all of the reserved
regions. Initially, the idea was to extend VFIO, however Alex pointed
out a horrible scenario:

1. QEMU spawns a VM on system 0
2. VM is migrated to system 1
3. QEMU attempts to passthrough a device using PCI hotplug

In this scenario, the guest memory map is chosen at step (1), yet there
is no VFIO fd available to determine the reserved regions. Furthermore,
the reserved regions may vary between system 0 and system 1. This pretty
much rules out using VFIO to determine the reserved regions.Alex suggested
that the SMMU driver can advertise the regions via /sys/class/iommu/. This
would solve part of the problem, but migration between systems with
different memory maps can still cause problems if the reserved regions
of the new system conflict with the guest memory map chosen by QEMU.

OK so I understand we do not want anymore the VFIO chain capability API
(patch 5 of above series) but we prefer a sysfs approach instead.
Right.

I understand the sysfs approach which allows the userspace to get the
info earlier and independently on VFIO. Keeping in mind current QEMU
virt - which is not the only userspace - will not do much from this info
until we bring upheavals in virt address space management. So if I am
not wrong, at the moment the main action to be undertaken is the
rejection of the PCI hotplug in case we detect a collision.
I don't think so; it should be up to userspace to reject the hotplug.
If userspace doesn't have support for the regions, then that's fine --
you just end up in a situation where the CPU page table maps memory
somewhere that the device can't see. In other words, you'll end up with
spurious DMA failures, but that's exactly what happens with current systems
if you passthrough an overlapping region (Robin demonstrated this on Juno).

Additionally, you can imagine some future support where you can tell the
guest not to use certain regions of its memory for DMA. In this case, you
wouldn't want to refuse the hotplug in the case of overlapping regions.

Really, I think the kernel side just needs to enumerate the fixed reserved
regions, place the doorbell at a fixed address and then advertise these
via sysfs.

I can respin [1]
- studying and taking into account Robin's comments about dm_regions
similarities
- removing the VFIO capability chain and replacing this by a sysfs API
Ideally, this would be reusable between different SMMU drivers so the sysfs
entries have the same format etc.

Would that be OK?
Sounds good to me. Are you in a position to prototype something on the qemu
side once we've got kernel-side agreement?

What about Alex comments who wanted to report the usable memory ranges
instead of unusable memory ranges?

Also did you have a chance to discuss the following items:
1) the VFIO irq safety assessment
The discussion really focussed on system topology, as opposed to properties
of the doorbell. Regardless of how the device talks to the doorbell, if
the doorbell can't protect against things like MSI spoofing, then it's
unsafe. My opinion is that we shouldn't allow passthrough by default on
systems with unsafe doorbells (we could piggyback on allow_unsafe_interrupts
cmdline option to VFIO).

A first step would be making all this opt-in, and only supporting GICv3
ITS for now.
You're trying to support a config that is < GICv3 and no ITS ? ...
That would be the equiv. of x86 pre-intr-remap, and that's why allow_unsafe_interrupts
hook was created ... to enable devel/kick-the-tires.
2) the MSI reserved size computation (is an arbitrary size OK?)
If we fix the base address, we could fix a size too. However, we'd still
need to enumerate the doorbells to check that they fit in the region we
have. If not, then we can warn during boot and treat it the same way as
a resource conflict (that is, reallocate the region in some deterministic
way).

Will