RE: [RFC] /dev/ioasid uAPI proposal

From: Tian, Kevin
Date: Tue Jun 01 2021 - 04:38:18 EST


> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Saturday, May 29, 2021 3:59 AM
>
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> >
> > 5. Use Cases and Flows
> >
> > Here assume VFIO will support a new model where every bound device
> > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > going through legacy container/group interface. For illustration purpose
> > those devices are just called dev[1...N]:
> >
> > device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> >
> > As explained earlier, one IOASID fd is sufficient for all intended use cases:
> >
> > ioasid_fd = open("/dev/ioasid", mode);
> >
> > For simplicity below examples are all made for the virtualization story.
> > They are representative and could be easily adapted to a non-virtualization
> > scenario.
>
> For others, I don't think this is *strictly* necessary, we can
> probably still get to the device_fd using the group_fd and fit in
> /dev/ioasid. It does make the rest of this more readable though.

Jason, want to confirm here. Per earlier discussion we remain an
impression that you want VFIO to be a pure device driver thus
container/group are used only for legacy application. From this
comment are you suggesting that VFIO can still keep container/
group concepts and user just deprecates the use of vfio iommu
uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has
a simple policy that an IOASID will reject cmd if partially-attached
group exists)?

>
>
> > Three types of IOASIDs are considered:
> >
> > gpa_ioasid[1...N]: for GPA address space
> > giova_ioasid[1...N]: for guest IOVA address space
> > gva_ioasid[1...N]: for guest CPU VA address space
> >
> > At least one gpa_ioasid must always be created per guest, while the other
> > two are relevant as far as vIOMMU is concerned.
> >
> > Examples here apply to both pdev and mdev, if not explicitly marked out
> > (e.g. in section 5.5). VFIO device driver in the kernel will figure out the
> > associated routing information in the attaching operation.
> >
> > For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
> > INFO are skipped in these examples.
> >
> > 5.1. A simple example
> > ++++++++++++++++++
> >
> > Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
> > space is managed through DMA mapping protocol:
> >
> > /* Bind device to IOASID fd */
> > device_fd = open("/dev/vfio/devices/dev1", mode);
> > ioasid_fd = open("/dev/ioasid", mode);
> > ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
> >
> > /* Attach device to IOASID */
> > gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> > at_data = { .ioasid = gpa_ioasid};
> > ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);
> >
> > /* Setup GPA mapping */
> > dma_map = {
> > .ioasid = gpa_ioasid;
> > .iova = 0; // GPA
> > .vaddr = 0x40000000; // HVA
> > .size = 1GB;
> > };
> > ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> >
> > If the guest is assigned with more than dev1, user follows above sequence
> > to attach other devices to the same gpa_ioasid i.e. sharing the GPA
> > address space cross all assigned devices.
>
> eg
>
> device2_fd = open("/dev/vfio/devices/dev1", mode);
> ioctl(device2_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
> ioctl(device2_fd, VFIO_ATTACH_IOASID, &at_data);
>
> Right?

Exactly, except a small typo ('dev1' -> 'dev2'). :)

>
> >
> > 5.2. Multiple IOASIDs (no nesting)
> > ++++++++++++++++++++++++++++
> >
> > Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> > both devices are attached to gpa_ioasid. After boot the guest creates
> > an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
> > through mode (gpa_ioasid).
> >
> > Suppose IOASID nesting is not supported in this case. Qemu need to
> > generate shadow mappings in userspace for giova_ioasid (like how
> > VFIO works today).
> >
> > To avoid duplicated locked page accounting, it's recommended to pre-
> > register the virtual address range that will be used for DMA:
> >
> > device_fd1 = open("/dev/vfio/devices/dev1", mode);
> > device_fd2 = open("/dev/vfio/devices/dev2", mode);
> > ioasid_fd = open("/dev/ioasid", mode);
> > ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
> > ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);
> >
> > /* pre-register the virtual address range for accounting */
> > mem_info = { .vaddr = 0x40000000; .size = 1GB };
> > ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info);
> >
> > /* Attach dev1 and dev2 to gpa_ioasid */
> > gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> > at_data = { .ioasid = gpa_ioasid};
> > ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> > ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > /* Setup GPA mapping */
> > dma_map = {
> > .ioasid = gpa_ioasid;
> > .iova = 0; // GPA
> > .vaddr = 0x40000000; // HVA
> > .size = 1GB;
> > };
> > ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> >
> > /* After boot, guest enables an GIOVA space for dev2 */
> > giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> >
> > /* First detach dev2 from previous address space */
> > at_data = { .ioasid = gpa_ioasid};
> > ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);
> >
> > /* Then attach dev2 to the new address space */
> > at_data = { .ioasid = giova_ioasid};
> > ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > /* Setup a shadow DMA mapping according to vIOMMU
> > * GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000)
> > */
>
> Here "shadow DMA" means relay the guest's vIOMMU page tables to the HW
> IOMMU?

'shadow' means the merged mapping: GIOVA(0x2000) -> HVA (0x40001000)

>
> > dma_map = {
> > .ioasid = giova_ioasid;
> > .iova = 0x2000; // GIOVA
> > .vaddr = 0x40001000; // HVA
>
> eg HVA came from reading the guest's page tables and finding it wanted
> GPA 0x1000 mapped to IOVA 0x2000?

yes

>
>
> > 5.3. IOASID nesting (software)
> > +++++++++++++++++++++++++
> >
> > Same usage scenario as 5.2, with software-based IOASID nesting
> > available. In this mode it is the kernel instead of user to create the
> > shadow mapping.
> >
> > The flow before guest boots is same as 5.2, except one point. Because
> > giova_ioasid is nested on gpa_ioasid, locked accounting is only
> > conducted for gpa_ioasid. So it's not necessary to pre-register virtual
> > memory.
> >
> > To save space we only list the steps after boots (i.e. both dev1/dev2
> > have been attached to gpa_ioasid before guest boots):
> >
> > /* After boots */
> > /* Make GIOVA space nested on GPA space */
> > giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > gpa_ioasid);
> >
> > /* Attach dev2 to the new address space (child)
> > * Note dev2 is still attached to gpa_ioasid (parent)
> > */
> > at_data = { .ioasid = giova_ioasid};
> > ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > /* Setup a GIOVA->GPA mapping for giova_ioasid, which will be
> > * merged by the kernel with GPA->HVA mapping of gpa_ioasid
> > * to form a shadow mapping.
> > */
> > dma_map = {
> > .ioasid = giova_ioasid;
> > .iova = 0x2000; // GIOVA
> > .vaddr = 0x1000; // GPA
> > .size = 4KB;
> > };
> > ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
>
> And in this version the kernel reaches into the parent IOASID's page
> tables to translate 0x1000 to 0x40001000 to physical page? So we
> basically remove the qemu process address space entirely from this
> translation. It does seem convenient

yes.

>
> > 5.4. IOASID nesting (hardware)
> > +++++++++++++++++++++++++
> >
> > Same usage scenario as 5.2, with hardware-based IOASID nesting
> > available. In this mode the pgtable binding protocol is used to
> > bind the guest IOVA page table with the IOMMU:
> >
> > /* After boots */
> > /* Make GIOVA space nested on GPA space */
> > giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > gpa_ioasid);
> >
> > /* Attach dev2 to the new address space (child)
> > * Note dev2 is still attached to gpa_ioasid (parent)
> > */
> > at_data = { .ioasid = giova_ioasid};
> > ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > /* Bind guest I/O page table */
> > bind_data = {
> > .ioasid = giova_ioasid;
> > .addr = giova_pgtable;
> > // and format information
> > };
> > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>
> I really think you need to use consistent language. Things that
> allocate a new IOASID should be calle IOASID_ALLOC_IOASID. If multiple
> IOCTLs are needed then it is IOASID_ALLOC_IOASID_PGTABLE, etc.
> alloc/create/bind is too confusing.
>
> > 5.5. Guest SVA (vSVA)
> > ++++++++++++++++++
> >
> > After boots the guest further create a GVA address spaces (gpasid1) on
> > dev1. Dev2 is not affected (still attached to giova_ioasid).
> >
> > As explained in section 4, user should avoid expose ENQCMD on both
> > pdev and mdev.
> >
> > The sequence applies to all device types (being pdev or mdev), except
> > one additional step to call KVM for ENQCMD-capable mdev:
> >
> > /* After boots */
> > /* Make GVA space nested on GPA space */
> > gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > gpa_ioasid);
> >
> > /* Attach dev1 to the new address space and specify vPASID */
> > at_data = {
> > .ioasid = gva_ioasid;
> > .flag = IOASID_ATTACH_USER_PASID;
> > .user_pasid = gpasid1;
> > };
> > ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
>
> Still a little unsure why the vPASID is here not on the gva_ioasid. Is
> there any scenario where we want different vpasid's for the same
> IOASID? I guess it is OK like this. Hum.

Yes, it's completely sane that the guest links a I/O page table to
different vpasids on dev1 and dev2. The IOMMU doesn't mandate
that when multiple devices share an I/O page table they must use
the same PASID#.

>
> > /* if dev1 is ENQCMD-capable mdev, update CPU PASID
> > * translation structure through KVM
> > */
> > pa_data = {
> > .ioasid_fd = ioasid_fd;
> > .ioasid = gva_ioasid;
> > .guest_pasid = gpasid1;
> > };
> > ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);
>
> Make sense
>
> > /* Bind guest I/O page table */
> > bind_data = {
> > .ioasid = gva_ioasid;
> > .addr = gva_pgtable1;
> > // and format information
> > };
> > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>
> Again I do wonder if this should just be part of alloc_ioasid. Is
> there any reason to split these things? The only advantage to the
> split is the device is known, but the device shouldn't impact
> anything..

I summarized this as open#4 in another mail for focused discussion.

>
> > 5.6. I/O page fault
> > +++++++++++++++
> >
> > (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> > to guest IOMMU driver and backwards).
> >
> > - Host IOMMU driver receives a page request with raw fault_data {rid,
> > pasid, addr};
> >
> > - Host IOMMU driver identifies the faulting I/O page table according to
> > information registered by IOASID fault handler;
> >
> > - IOASID fault handler is called with raw fault_data (rid, pasid, addr),
> which
> > is saved in ioasid_data->fault_data (used for response);
> >
> > - IOASID fault handler generates an user fault_data (ioasid, addr), links it
> > to the shared ring buffer and triggers eventfd to userspace;
>
> Here rid should be translated to a labeled device and return the
> device label from VFIO_BIND_IOASID_FD. Depending on how the device
> bound the label might match to a rid or to a rid,pasid

Yes, I acknowledged this input from you and Jean about page fault and
bind_pasid_table. I summarized it as open#3 in another mail.

thus following is skipped...

Thanks
Kevin

>
> > - Upon received event, Qemu needs to find the virtual routing information
> > (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are
> > multiple, pick a random one. This should be fine since the purpose is to
> > fix the I/O page table on the guest;
>
> The device label should fix this
>
> > - Qemu finds the pending fault event, converts virtual completion data
> > into (ioasid, response_code), and then calls a /dev/ioasid ioctl to
> > complete the pending fault;
> >
> > - /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in
> > ioasid_data->fault_data, and then calls iommu api to complete it with
> > {rid, pasid, response_code};
>
> So resuming a fault on an ioasid will resume all devices pending on
> the fault?
>
> > 5.7. BIND_PASID_TABLE
> > ++++++++++++++++++++
> >
> > PASID table is put in the GPA space on some platform, thus must be
> updated
> > by the guest. It is treated as another user page table to be bound with the
> > IOMMU.
> >
> > As explained earlier, the user still needs to explicitly bind every user I/O
> > page table to the kernel so the same pgtable binding protocol (bind, cache
> > invalidate and fault handling) is unified cross platforms.
> >
> > vIOMMUs may include a caching mode (or paravirtualized way) which,
> once
> > enabled, requires the guest to invalidate PASID cache for any change on the
> > PASID table. This allows Qemu to track the lifespan of guest I/O page tables.
> >
> > In case of missing such capability, Qemu could enable write-protection on
> > the guest PASID table to achieve the same effect.
> >
> > /* After boots */
> > /* Make vPASID space nested on GPA space */
> > pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > gpa_ioasid);
> >
> > /* Attach dev1 to pasidtbl_ioasid */
> > at_data = { .ioasid = pasidtbl_ioasid};
> > ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> >
> > /* Bind PASID table */
> > bind_data = {
> > .ioasid = pasidtbl_ioasid;
> > .addr = gpa_pasid_table;
> > // and format information
> > };
> > ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);
> >
> > /* vIOMMU detects a new GVA I/O space created */
> > gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > gpa_ioasid);
> >
> > /* Attach dev1 to the new address space, with gpasid1 */
> > at_data = {
> > .ioasid = gva_ioasid;
> > .flag = IOASID_ATTACH_USER_PASID;
> > .user_pasid = gpasid1;
> > };
> > ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> >
> > /* Bind guest I/O page table. Because SET_PASID_TABLE has been
> > * used, the kernel will not update the PASID table. Instead, just
> > * track the bound I/O page table for handling invalidation and
> > * I/O page faults.
> > */
> > bind_data = {
> > .ioasid = gva_ioasid;
> > .addr = gva_pgtable1;
> > // and format information
> > };
> > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>
> I still don't quite get the benifit from doing this.
>
> The idea to create an all PASID IOASID seems to work better with less
> fuss on HW that is directly parsing the guest's PASID table.
>
> Cache invalidate seems easy enough to support
>
> Fault handling needs to return the (ioasid, device_label, pasid) when
> working with this kind of ioasid.
>
> It is true that it does create an additional flow qemu has to
> implement, but it does directly mirror the HW.
>
> Jason