Re: [RFC] /dev/ioasid uAPI proposal
From: Jason Gunthorpe
Date: Fri May 28 2021 - 15:58:47 EST
On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
>
> 5. Use Cases and Flows
>
> Here assume VFIO will support a new model where every bound device
> is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> going through legacy container/group interface. For illustration purpose
> those devices are just called dev[1...N]:
>
> device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
>
> As explained earlier, one IOASID fd is sufficient for all intended use cases:
>
> ioasid_fd = open("/dev/ioasid", mode);
>
> For simplicity below examples are all made for the virtualization story.
> They are representative and could be easily adapted to a non-virtualization
> scenario.
For others, I don't think this is *strictly* necessary, we can
probably still get to the device_fd using the group_fd and fit in
/dev/ioasid. It does make the rest of this more readable though.
> Three types of IOASIDs are considered:
>
> gpa_ioasid[1...N]: for GPA address space
> giova_ioasid[1...N]: for guest IOVA address space
> gva_ioasid[1...N]: for guest CPU VA address space
>
> At least one gpa_ioasid must always be created per guest, while the other
> two are relevant as far as vIOMMU is concerned.
>
> Examples here apply to both pdev and mdev, if not explicitly marked out
> (e.g. in section 5.5). VFIO device driver in the kernel will figure out the
> associated routing information in the attaching operation.
>
> For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
> INFO are skipped in these examples.
>
> 5.1. A simple example
> ++++++++++++++++++
>
> Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
> space is managed through DMA mapping protocol:
>
> /* Bind device to IOASID fd */
> device_fd = open("/dev/vfio/devices/dev1", mode);
> ioasid_fd = open("/dev/ioasid", mode);
> ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
>
> /* Attach device to IOASID */
> gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> at_data = { .ioasid = gpa_ioasid};
> ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);
>
> /* Setup GPA mapping */
> dma_map = {
> .ioasid = gpa_ioasid;
> .iova = 0; // GPA
> .vaddr = 0x40000000; // HVA
> .size = 1GB;
> };
> ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
>
> If the guest is assigned with more than dev1, user follows above sequence
> to attach other devices to the same gpa_ioasid i.e. sharing the GPA
> address space cross all assigned devices.
eg
device2_fd = open("/dev/vfio/devices/dev1", mode);
ioctl(device2_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
ioctl(device2_fd, VFIO_ATTACH_IOASID, &at_data);
Right?
>
> 5.2. Multiple IOASIDs (no nesting)
> ++++++++++++++++++++++++++++
>
> Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> both devices are attached to gpa_ioasid. After boot the guest creates
> an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
> through mode (gpa_ioasid).
>
> Suppose IOASID nesting is not supported in this case. Qemu need to
> generate shadow mappings in userspace for giova_ioasid (like how
> VFIO works today).
>
> To avoid duplicated locked page accounting, it's recommended to pre-
> register the virtual address range that will be used for DMA:
>
> device_fd1 = open("/dev/vfio/devices/dev1", mode);
> device_fd2 = open("/dev/vfio/devices/dev2", mode);
> ioasid_fd = open("/dev/ioasid", mode);
> ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
> ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);
>
> /* pre-register the virtual address range for accounting */
> mem_info = { .vaddr = 0x40000000; .size = 1GB };
> ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info);
>
> /* Attach dev1 and dev2 to gpa_ioasid */
> gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> at_data = { .ioasid = gpa_ioasid};
> ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> /* Setup GPA mapping */
> dma_map = {
> .ioasid = gpa_ioasid;
> .iova = 0; // GPA
> .vaddr = 0x40000000; // HVA
> .size = 1GB;
> };
> ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
>
> /* After boot, guest enables an GIOVA space for dev2 */
> giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
>
> /* First detach dev2 from previous address space */
> at_data = { .ioasid = gpa_ioasid};
> ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);
>
> /* Then attach dev2 to the new address space */
> at_data = { .ioasid = giova_ioasid};
> ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> /* Setup a shadow DMA mapping according to vIOMMU
> * GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000)
> */
Here "shadow DMA" means relay the guest's vIOMMU page tables to the HW
IOMMU?
> dma_map = {
> .ioasid = giova_ioasid;
> .iova = 0x2000; // GIOVA
> .vaddr = 0x40001000; // HVA
eg HVA came from reading the guest's page tables and finding it wanted
GPA 0x1000 mapped to IOVA 0x2000?
> 5.3. IOASID nesting (software)
> +++++++++++++++++++++++++
>
> Same usage scenario as 5.2, with software-based IOASID nesting
> available. In this mode it is the kernel instead of user to create the
> shadow mapping.
>
> The flow before guest boots is same as 5.2, except one point. Because
> giova_ioasid is nested on gpa_ioasid, locked accounting is only
> conducted for gpa_ioasid. So it's not necessary to pre-register virtual
> memory.
>
> To save space we only list the steps after boots (i.e. both dev1/dev2
> have been attached to gpa_ioasid before guest boots):
>
> /* After boots */
> /* Make GIOVA space nested on GPA space */
> giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> gpa_ioasid);
>
> /* Attach dev2 to the new address space (child)
> * Note dev2 is still attached to gpa_ioasid (parent)
> */
> at_data = { .ioasid = giova_ioasid};
> ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> /* Setup a GIOVA->GPA mapping for giova_ioasid, which will be
> * merged by the kernel with GPA->HVA mapping of gpa_ioasid
> * to form a shadow mapping.
> */
> dma_map = {
> .ioasid = giova_ioasid;
> .iova = 0x2000; // GIOVA
> .vaddr = 0x1000; // GPA
> .size = 4KB;
> };
> ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
And in this version the kernel reaches into the parent IOASID's page
tables to translate 0x1000 to 0x40001000 to physical page? So we
basically remove the qemu process address space entirely from this
translation. It does seem convenient
> 5.4. IOASID nesting (hardware)
> +++++++++++++++++++++++++
>
> Same usage scenario as 5.2, with hardware-based IOASID nesting
> available. In this mode the pgtable binding protocol is used to
> bind the guest IOVA page table with the IOMMU:
>
> /* After boots */
> /* Make GIOVA space nested on GPA space */
> giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> gpa_ioasid);
>
> /* Attach dev2 to the new address space (child)
> * Note dev2 is still attached to gpa_ioasid (parent)
> */
> at_data = { .ioasid = giova_ioasid};
> ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> /* Bind guest I/O page table */
> bind_data = {
> .ioasid = giova_ioasid;
> .addr = giova_pgtable;
> // and format information
> };
> ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
I really think you need to use consistent language. Things that
allocate a new IOASID should be calle IOASID_ALLOC_IOASID. If multiple
IOCTLs are needed then it is IOASID_ALLOC_IOASID_PGTABLE, etc.
alloc/create/bind is too confusing.
> 5.5. Guest SVA (vSVA)
> ++++++++++++++++++
>
> After boots the guest further create a GVA address spaces (gpasid1) on
> dev1. Dev2 is not affected (still attached to giova_ioasid).
>
> As explained in section 4, user should avoid expose ENQCMD on both
> pdev and mdev.
>
> The sequence applies to all device types (being pdev or mdev), except
> one additional step to call KVM for ENQCMD-capable mdev:
>
> /* After boots */
> /* Make GVA space nested on GPA space */
> gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> gpa_ioasid);
>
> /* Attach dev1 to the new address space and specify vPASID */
> at_data = {
> .ioasid = gva_ioasid;
> .flag = IOASID_ATTACH_USER_PASID;
> .user_pasid = gpasid1;
> };
> ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
Still a little unsure why the vPASID is here not on the gva_ioasid. Is
there any scenario where we want different vpasid's for the same
IOASID? I guess it is OK like this. Hum.
> /* if dev1 is ENQCMD-capable mdev, update CPU PASID
> * translation structure through KVM
> */
> pa_data = {
> .ioasid_fd = ioasid_fd;
> .ioasid = gva_ioasid;
> .guest_pasid = gpasid1;
> };
> ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);
Make sense
> /* Bind guest I/O page table */
> bind_data = {
> .ioasid = gva_ioasid;
> .addr = gva_pgtable1;
> // and format information
> };
> ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
Again I do wonder if this should just be part of alloc_ioasid. Is
there any reason to split these things? The only advantage to the
split is the device is known, but the device shouldn't impact
anything..
> 5.6. I/O page fault
> +++++++++++++++
>
> (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> to guest IOMMU driver and backwards).
>
> - Host IOMMU driver receives a page request with raw fault_data {rid,
> pasid, addr};
>
> - Host IOMMU driver identifies the faulting I/O page table according to
> information registered by IOASID fault handler;
>
> - IOASID fault handler is called with raw fault_data (rid, pasid, addr), which
> is saved in ioasid_data->fault_data (used for response);
>
> - IOASID fault handler generates an user fault_data (ioasid, addr), links it
> to the shared ring buffer and triggers eventfd to userspace;
Here rid should be translated to a labeled device and return the
device label from VFIO_BIND_IOASID_FD. Depending on how the device
bound the label might match to a rid or to a rid,pasid
> - Upon received event, Qemu needs to find the virtual routing information
> (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are
> multiple, pick a random one. This should be fine since the purpose is to
> fix the I/O page table on the guest;
The device label should fix this
> - Qemu finds the pending fault event, converts virtual completion data
> into (ioasid, response_code), and then calls a /dev/ioasid ioctl to
> complete the pending fault;
>
> - /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in
> ioasid_data->fault_data, and then calls iommu api to complete it with
> {rid, pasid, response_code};
So resuming a fault on an ioasid will resume all devices pending on
the fault?
> 5.7. BIND_PASID_TABLE
> ++++++++++++++++++++
>
> PASID table is put in the GPA space on some platform, thus must be updated
> by the guest. It is treated as another user page table to be bound with the
> IOMMU.
>
> As explained earlier, the user still needs to explicitly bind every user I/O
> page table to the kernel so the same pgtable binding protocol (bind, cache
> invalidate and fault handling) is unified cross platforms.
>
> vIOMMUs may include a caching mode (or paravirtualized way) which, once
> enabled, requires the guest to invalidate PASID cache for any change on the
> PASID table. This allows Qemu to track the lifespan of guest I/O page tables.
>
> In case of missing such capability, Qemu could enable write-protection on
> the guest PASID table to achieve the same effect.
>
> /* After boots */
> /* Make vPASID space nested on GPA space */
> pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> gpa_ioasid);
>
> /* Attach dev1 to pasidtbl_ioasid */
> at_data = { .ioasid = pasidtbl_ioasid};
> ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
>
> /* Bind PASID table */
> bind_data = {
> .ioasid = pasidtbl_ioasid;
> .addr = gpa_pasid_table;
> // and format information
> };
> ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);
>
> /* vIOMMU detects a new GVA I/O space created */
> gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> gpa_ioasid);
>
> /* Attach dev1 to the new address space, with gpasid1 */
> at_data = {
> .ioasid = gva_ioasid;
> .flag = IOASID_ATTACH_USER_PASID;
> .user_pasid = gpasid1;
> };
> ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
>
> /* Bind guest I/O page table. Because SET_PASID_TABLE has been
> * used, the kernel will not update the PASID table. Instead, just
> * track the bound I/O page table for handling invalidation and
> * I/O page faults.
> */
> bind_data = {
> .ioasid = gva_ioasid;
> .addr = gva_pgtable1;
> // and format information
> };
> ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
I still don't quite get the benifit from doing this.
The idea to create an all PASID IOASID seems to work better with less
fuss on HW that is directly parsing the guest's PASID table.
Cache invalidate seems easy enough to support
Fault handling needs to return the (ioasid, device_label, pasid) when
working with this kind of ioasid.
It is true that it does create an additional flow qemu has to
implement, but it does directly mirror the HW.
Jason