RE: [RFC] /dev/ioasid uAPI proposal

From: Tian, Kevin
Date: Wed Jun 23 2021 - 03:57:30 EST


> From: Jean-Philippe Brucker
> Sent: Saturday, June 19, 2021 1:04 AM
>
> On Thu, Jun 17, 2021 at 01:00:14PM +1000, David Gibson wrote:
> > On Thu, Jun 10, 2021 at 06:37:31PM +0200, Jean-Philippe Brucker wrote:
> > > On Tue, Jun 08, 2021 at 04:31:50PM +1000, David Gibson wrote:
> > > > For the qemu case, I would imagine a two stage fallback:
> > > >
> > > > 1) Ask for the exact IOMMU capabilities (including pagetable
> > > > format) that the vIOMMU has. If the host can supply, you're
> > > > good
> > > >
> > > > 2) If not, ask for a kernel managed IOAS. Verify that it can map
> > > > all the IOVA ranges the guest vIOMMU needs, and has an equal or
> > > > smaller pagesize than the guest vIOMMU presents. If so,
> > > > software emulate the vIOMMU by shadowing guest io pagetable
> > > > updates into the kernel managed IOAS.
> > > >
> > > > 3) You're out of luck, don't start.
> > > >
> > > > For both (1) and (2) I'd expect it to be asking this question *after*
> > > > saying what devices are attached to the IOAS, based on the virtual
> > > > hardware configuration. That doesn't cover hotplug, of course, for
> > > > that you have to just fail the hotplug if the new device isn't
> > > > supportable with the IOAS you already have.
> > >
> > > Yes. So there is a point in time when the IOAS is frozen, and cannot take
> > > in new incompatible devices. I think that can support the usage I had in
> > > mind. If the VMM (non-QEMU, let's say) wanted to create one IOASID FD
> per
> > > feature set it could bind the first device, freeze the features, then bind
> >
> > Are you thinking of this "freeze the features" as an explicitly
> > triggered action? I have suggested that an explicit "ENABLE" step
> > might be useful, but that hasn't had much traction from what I've
> > seen.
>
> Seems like we do need an explicit enable step for the flow you described
> above:
>
> a) Bind all devices to an ioasid. Each bind succeeds.

let's use consistent terms in this discussion. :)

'bind' the device to a IOMMU fd (container of I/O address spaces).

'attach' the device to an IOASID (representing an I/O address space
within the IOMMU fd)

> b) Ask for a specific set of features for this aggregate of device. Ask
> for (1), fall back to (2), or abort.
> c) Boot the VM
> d) Hotplug a device, bind it to the ioasid. We're long past negotiating
> features for the ioasid, so the host needs to reject the bind if the
> new device is incompatible with what was requested at (b)
>
> So a successful request at (b) would be the point where we change the
> behavior of bind.

Per Jason's recommendation v2 will move to a new model:

a) Bind all devices to an IOMMU fd:
- The user should provide a 'device_cookie' to mark each bound
device in following uAPIs.

b) Successful binding allows user to check the capability/format info per
device_cookie (GET_DEVICE_INFO), before creating any IOASID:
- Sample capability info:
* VFIO type1 map: supported page sizes, permitted IOVA ranges, etc.;
* IOASID nesting: hardware nesting vs. software nesting;
* User-managed page table: vendor specific formats;
* User-managed pasid table: vendor specific formats;
* vPASID: whether delegated to user, if kernel-managed per-RID or global;
* coherency: what's kernel default policy, whether allows user to change;
* ...
- Actual logistics might be finalized when code is implemented;

c) When creating a new IOASID, the user should specify a format which
is compatible to one or more devices which will be attached to this
IOASID right after.

d) Attaching a device which has incompatible format to this IOASID
is simply rejected. Whether it's hotplugged doesn't matter.

Qemu is expected to query capability/format information for all devices
according to what a specified vIOMMU model requires. Then decide
whether to fail vIOMMU creation if not strictly matched or fall back to
a hybrid model with software emulation to bridge the gap. In any case
before a new I/O address space is created, Qemu should have a clear
picture about what format is required given a set of to-be-attached
devices and whether multiple IOASIDs are required if these devices
have incompatible formats.

With this model we don't need a separate 'enable' step.

>
> Since the kernel needs a form of feature check in any case, I still have a
> preference for aborting the bind at (a) if the device isn't exactly
> compatible with other devices already in the ioasid, because it might be
> simpler to implement in the host, but I don't feel strongly about this.

this is covered by d). Actually with all the format information available
Qemu even should not attempt to attach incompatible device in the
first place, though the kernel will do this simple check under the hood.

Thanks
Kevin