RE: [RFC] /dev/ioasid uAPI proposal
From: Tian, Kevin
Date: Wed Jun 02 2021 - 23:05:03 EST
> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Thursday, June 3, 2021 1:20 AM
>
[...]
> > I wonder if there's a way to model this using a nested AS rather than
> > requiring special operations. e.g.
> >
> > 'prereg' IOAS
> > |
> > \- 'rid' IOAS
> > |
> > \- 'pasid' IOAS (maybe)
> >
> > 'prereg' would have a kernel managed pagetable into which (for
> > example) qemu platform code would map all guest memory (using
> > IOASID_MAP_DMA). qemu's vIOMMU driver would then mirror the guest's
> > IO mappings into the 'rid' IOAS in terms of GPA.
> >
> > This wouldn't quite work as is, because the 'prereg' IOAS would have
> > no devices. But we could potentially have another call to mark an
> > IOAS as a purely "preregistration" or pure virtual IOAS. Using that
> > would be an alternative to attaching devices.
>
> It is one option for sure, this is where I was thinking when we were
> talking in the other thread. I think the decision is best
> implementation driven as the datastructure to store the
> preregsitration data should be rather purpose built.
Yes. For now I prefer to managing prereg through a separate cmd
instead of special-casing it in the IOASID graph. Anyway this is sort
of a per-fd thing.
>
> > > /*
> > > * Map/unmap process virtual addresses to I/O virtual addresses.
> > > *
> > > * Provide VFIO type1 equivalent semantics. Start with the same
> > > * restriction e.g. the unmap size should match those used in the
> > > * original mapping call.
> > > *
> > > * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
> > > * must be already in the preregistered list.
> > > *
> > > * Input parameters:
> > > * - u32 ioasid;
> > > * - refer to vfio_iommu_type1_dma_{un}map
> > > *
> > > * Return: 0 on success, -errno on failure.
> > > */
> > > #define IOASID_MAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 6)
> > > #define IOASID_UNMAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 7)
> >
> > I'm assuming these would be expected to fail if a user managed
> > pagetable has been bound?
>
> Me too, or a SVA page table.
>
> This document would do well to have a list of imagined page table
> types and the set of operations that act on them. I think they are all
> pretty disjoint..
>
> Your presentation of 'kernel owns the table' vs 'userspace owns the
> table' is a useful clarification to call out too
sure, I incorporated this comment in last reply.
>
> > > 5. Use Cases and Flows
> > >
> > > Here assume VFIO will support a new model where every bound device
> > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > > going through legacy container/group interface. For illustration purpose
> > > those devices are just called dev[1...N]:
> > >
> > > device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> >
> > Minor detail, but I'd suggest /dev/vfio/pci/DDDD:BB:SS.F for the
> > filenames for actual PCI functions. Maybe /dev/vfio/mdev/something
> > for mdevs. That leaves other subdirs of /dev/vfio free for future
> > non-PCI device types, and /dev/vfio itself for the legacy group
> > devices.
>
> There are a bunch of nice options here if we go this path
Yes, this part is only roughly visited to focus on /dev/iommu first. In later
versions it will be considered more seriously.
>
> > > 5.2. Multiple IOASIDs (no nesting)
> > > ++++++++++++++++++++++++++++
> > >
> > > Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> > > both devices are attached to gpa_ioasid.
> >
> > Doesn't really affect your example, but note that the PAPR IOMMU does
> > not have a passthrough mode, so devices will not initially be attached
> > to gpa_ioasid - they will be unusable for DMA until attached to a
> > gIOVA ioasid.
'initially' here is still user-requested action. For PAPR you should do
attach only when it's necessary.
Thanks
Kevin