[RFC v2] /dev/iommu uAPI proposal

From: Tian, Kevin
Date: Fri Jul 09 2021 - 03:48:54 EST


/dev/iommu provides an unified interface for managing I/O page tables for
devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
etc.) are expected to use this interface instead of creating their own logic to
isolate untrusted device DMAs initiated by userspace.

This proposal describes the uAPI of /dev/iommu and also sample sequences
with VFIO as example in typical usages. The driver-facing kernel API provided
by the iommu layer is still TBD, which can be discussed after consensus is
made on this uAPI.

It's based on a lengthy discussion starting from here:
https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@xxxxxxxxxx/

v1 can be found here:
https://lore.kernel.org/linux-iommu/PH0PR12MB54811863B392C644E5365446DC3E9@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/T/

This doc is also tracked on github, though it's not very useful for v1->v2
given dramatic refactoring:
https://github.com/luxis1999/dev_iommu_uapi

Changelog (v1->v2):
- Rename /dev/ioasid to /dev/iommu (Jason);
- Add a section for device-centric vs. group-centric design (many);
- Add a section for handling no-snoop DMA (Jason/Alex/Paolo);
- Add definition of user/kernel/shared I/O page tables (Baolu/Jason);
- Allow one device bound to multiple iommu fd's (Jason);
- No need to track user I/O page tables in kernel on ARM/AMD (Jean/Jason);
- Add a device cookie for iotlb invalidation and fault handling (Jean/Jason);
- Add capability/format query interface per device cookie (Jason);
- Specify format/attribute when creating an IOASID, leading to several v1
uAPI commands removed (Jason);
- Explain the value of software nesting (Jean);
- Replace IOASID_REGISTER_VIRTUAL_MEMORY with software nesting (David/Jason);
- Cover software mdev usage (Jason);
- No restriction on map/unmap vs. bind/invalidate (Jason/David);
- Report permitted IOVA range instead of reserved range (David);
- Refine the sample structures and helper functions (Jason);
- Add definition of default and non-default I/O address spaces;
- Expand and clarify the design for PASID virtualization;
- and lots of subtle refinement according to above changes;

TOC
====
1. Terminologies and Concepts
1.1. Manage I/O address space
1.2. Attach device to I/O address space
1.3. Group isolation
1.4. PASID virtualization
1.4.1. Devices which don't support DMWr
1.4.2. Devices which support DMWr
1.4.3. Mix different types together
1.4.4. User sequence
1.5. No-snoop DMA
2. uAPI Proposal
2.1. /dev/iommu uAPI
2.2. /dev/vfio device uAPI
2.3. /dev/kvm uAPI
3. Sample Structures and Helper Functions
4. Use Cases and Flows
4.1. A simple example
4.2. Multiple IOASIDs (no nesting)
4.3. IOASID nesting (software)
4.4. IOASID nesting (hardware)
4.5. Guest SVA (vSVA)
4.6. I/O page fault
====

1. Terminologies and Concepts
-----------------------------------------

IOMMU fd is the container holding multiple I/O address spaces. User
manages those address spaces through fd operations. Multiple fd's are
allowed per process, but with this proposal one fd should be sufficient for
all intended usages.

IOASID is the fd-local software handle representing an I/O address space.
Each IOASID is associated with a single I/O page table. IOASIDs can be
nested together, implying the output address from one I/O page table
(represented by child IOASID) must be further translated by another I/O
page table (represented by parent IOASID).

An I/O address space takes effect only after it is attached by a device.
One device is allowed to attach to multiple I/O address spaces. One I/O
address space can be attached by multiple devices.

Device must be bound to an IOMMU fd before attach operation can be
conducted. Though not necessary, user could bind one device to multiple
IOMMU FD's. But no cross-FD IOASID nesting is allowed.

The format of an I/O page table must be compatible to the attached
devices (or more specifically to the IOMMU which serves the DMA from
the attached devices). User is responsible for specifying the format
when allocating an IOASID, according to one or multiple devices which
will be attached right after. Attaching a device to an IOASID with
incompatible format is simply rejected.

Relationship between IOMMU fd, VFIO fd and KVM fd:

- IOMMU fd provides uAPI for managing IOASIDs and I/O page tables.
It also provides an unified capability/format reporting interface for
each bound device.

- VFIO fd provides uAPI for device binding and attaching. In this proposal
VFIO is used as the example of device passthrough frameworks. The
routing information that identifies an I/O address space in the wire is
per-device and registered to IOMMU fd via VFIO uAPI.

- KVM fd provides uAPI for handling no-snoop DMA and PASID virtualization
in CPU (when PASID is carried in instruction payload).

1.1. Manage I/O address space
+++++++++++++++++++++++++++++

An I/O address space can be created in three ways, according to how
the corresponding I/O page table is managed:

- kernel-managed I/O page table which is created via IOMMU fd, e.g.
for IOVA space (dpdk), GPA space (Qemu), GIOVA space (vIOMMU), etc.

- user-managed I/O page table which is created by the user, e.g. for
GIOVA/GVA space (vIOMMU), etc.

- shared kernel-managed CPU page table which is created by another
subsystem, e.g. for process VA space (mm), GPA space (kvm), etc.

The first category is managed via a dma mapping protocol (similar to
existing VFIO iommu type1), which allows the user to explicitly specify
which range in the I/O address space should be mapped.

The second category is managed via an iotlb protocol (similar to the
underlying IOMMU semantics). Once the user-managed page table is
bound to the IOMMU, the user can invoke an invalidation command
to update the kernel-side cache (either in software or in physical IOMMU).
In the meantime, a fault reporting/completion mechanism is also provided
for the user to fixup potential I/O page faults.

The last category is supposed to be managed via the subsystem which
actually owns the shared address space. Likely what's minimally required
in /dev/iommu uAPI is to build the connection with the address space
owner when allocating the IOASID, so an in-kernel interface (e.g. mmu_
notifer) is activated for any required synchronization between IOMMU fd
and the space owner.

This proposal focuses on how to manage the first two categories, as
they are existing and more urgent requirements. Support of the last
category can be discussed when a real usage comes in the future.

The user needs to specify the desired management protocol and page
table format when creating a new I/O address space. Before allocating
the IOASID, the user should already know at least one device that will be
attached to this space. It is expected to first query (via IOMMU fd) the
supported capabilities and page table format information of the to-be-
attached device (or a common set between multiple devices) and then
choose a compatible format to set on the IOASID.

I/O address spaces can be nested together, called IOASID nesting. IOASID
nesting can be implemented in two ways: hardware nesting and software
nesting. With hardware support the child and parent I/O page tables are
walked consecutively by the IOMMU to form a nested translation. When
it's implemented in software, /dev/iommu is responsible for merging the
two-level mappings into a single-level shadow I/O page table.

An user-managed I/O page table can be setup only on the child IOASID,
implying IOASID nesting must be enabled. This is because the kernel
doesn't trust userspace. Nesting allows the kernel to enforce its DMA
isolation policy through the parent IOASID.

Software nesting is useful in several scenarios. First, it allows
centralized accounting on locked pages between multiple root IOASIDs
(no parent). In this case a 'dummy' IOASID can be created with an
identity mapping (HVA->HVA), dedicated for page pinning/accounting and
nested by all root IOASIDs. Second, it's also useful for mdev drivers
(e.g. kvmgt) to write-protect guest structures when vIOMMU is enabled.
In this case the protected addresses are in GIOVA space while KVM
write-protection API is based on GPA. Software nesting allows finding
GPA according to GIOVA in the kernel.

1.2. Attach Device to I/O address space
+++++++++++++++++++++++++++++++++++++++

Device attach/bind is initiated through passthrough framework uAPI.

Device attaching is allowed only after a device is successfully bound to
the IOMMU fd. User should provide a device cookie when binding the
device through VFIO uAPI. This cookie is used when the user queries
device capability/format, issues per-device iotlb invalidation and
receives per-device I/O page fault data via IOMMU fd.

Successful binding puts the device into a security context which isolates
its DMA from the rest system. VFIO should not allow user to access the
device before binding is completed. Similarly, VFIO should prevent the
user from unbinding the device before user access is withdrawn.

When a device is in an iommu group which contains multiple devices,
all devices within the group must enter/exit the security context
together. Please check {1.3} for more info about group isolation via
this device-centric design.

Successful attaching activates an I/O address space in the IOMMU,
if the device is not purely software mediated. VFIO must provide device
specific routing information for where to install the I/O page table in
the IOMMU for this device. VFIO must also guarantee that the attached
device is configured to compose DMAs with the routing information that
is provided in the attaching call. When handling DMA requests, IOMMU
identifies the target I/O address space according to the routing
information carried in the request. Misconfiguration breaks DMA
isolation thus could lead to severe security vulnerability.

Routing information is per-device and bus specific. For PCI, it is
Requester ID (RID) identifying the device plus optional Process Address
Space ID (PASID). For ARM, it is Stream ID (SID) plus optional Sub-Stream
ID (SSID). PASID or SSID is used when multiple I/O address spaces are
enabled on a single device. For simplicity and continuity reason the
following context uses RID+PASID though SID+SSID may sound a clearer
naming from device p.o.v. We can decide the actual naming when coding.

Because one I/O address space can be attached by multiple devices,
per-device routing information (plus device cookie) is tracked under
each IOASID and is used respectively when activating the I/O address
space in the IOMMU for each attached device.

The device in the /dev/iommu context always refers to a physical one
(pdev) which is identifiable via RID. Physically each pdev can support
one default I/O address space (routed via RID) and optionally multiple
non-default I/O address spaces (via RID+PASID).

The device in VFIO context is a logic concept, being either a physical
device (pdev) or mediated device (mdev or subdev). Each vfio device
is represented by RID+cookie in IOMMU fd. User is allowed to create
one default I/O address space (routed by vRID from user p.o.v) per
each vfio_device. VFIO decides the routing information for this default
space based on device type:

1) pdev, routed via RID;

2) mdev/subdev with IOMMU-enforced DMA isolation, routed via
the parent's RID plus the PASID marking this mdev;

3) a purely sw-mediated device (sw mdev), no routing required i.e. no
need to install the I/O page table in the IOMMU. sw mdev just uses
the metadata to assist its internal DMA isolation logic on top of
the parent's IOMMU page table;

In addition, VFIO may allow user to create additional I/O address spaces
on a vfio_device based on the hardware capability. In such case the user
has its own view of the virtual routing information (vPASID) when marking
these non-default address spaces. How to virtualize vPASID is platform
specific and device specific. Some platforms allow the user to fully
manage the PASID space thus vPASIDs are directly used for routing and
even hidden from the kernel. Other platforms require the user to
explicitly register the vPASID information to the kernel when attaching
the vfio_device. In this case VFIO must figure out whether vPASID should
be directly used (pdev) or converted to a kernel-allocated pPASID (mdev)
for physical routing. Detail explanation about PASID virtualization can
be found in {1.4}.

For mdev both default and non-default I/O address spaces are routed
via PASIDs. To better differentiate them we use "default PASID" (or
defPASID) when talking about the default I/O address space on mdev. When
vPASID or pPASID is referred in PASID virtualization it's all about the
non-default spaces. defPASID and pPASID are always hidden from userspace
and can only be indirectly referenced via IOASID.

1.3. Group isolation
++++++++++++++++++++

Group is the minimal object when talking about DMA isolation in the
iommu layer. Devices which cannot be isolated from each other are
organized into a single group. Lack of isolation could be caused by
multiple reasons: no ACS capability in the upstreaming port, behind a
PCIe-to-PCI bridge (thus sharing RID), or DMA aliasing (multiple RIDs
per device), etc.

All devices in the group must be put in a security context together
before one or more devices in the group are operated by an untrusted
user. Passthrough frameworks must guarantee that:

1) No user access is granted on a device before an security context is
established for the entire group (becomes viable).

2) Group viability is not broken before the user relinquishes the device.
This implies that devices in the group must be either assigned to this
user, or driver-less, or bound to a driver which is known safe (not
do DMA).

3) The security context should not be destroyed before user access
permission is withdrawn.

Existing VFIO introduces explicit container and group semantics in its
uAPI to meet above requirements:

1) VFIO user can open a device fd only after:

* A container is created;
* The group is attached to the container (VFIO_GROUP_SET_CONTAINER);
* An empty I/O page table is created in the container (VFIO_SET_IOMMU);
* Group viability is passed and the entire group is attached to
the empty I/O page table (the security context);

2) VFIO monitors driver binding status to verify group viability

* IOMMU_GROUP_NOTIFY_BOUND_DRIVER;
* BUG_ON() if group viability is broken;

3) Detach the group from the container when the last device fd in the
group is closed and destroy the I/O page table only after the last
group is detached from the container.

With this proposal VFIO can move to a simpler device-centric model by
directly exposeing device node under "/dev/vfio/devices" w/o using
container and group uAPI at all. In this case group isolation is enforced
mplicitly within IOMMU fd:

1) A successful binding call for the first device in the group creates
the security context for the entire group, by:

* Verifying group viability in a similar way as VFIO does;

* Calling IOMMU-API to move the group into a block-dma state,
which makes all devices in the group attached to an block-dma
domain with an empty I/O page table;

VFIO should not allow the user to mmap the MMIO bar of the bound
device until the binding call succeeds.

Binding other devices in the same group just succeeds since the
security context has already been established for the entire group.

2) IOMMU fd monitors driver binding status in case group viability is
broken, same as VFIO does today. BUG_ON() might be eliminated if we
can find a way to deny probe of non-iommu-safe drivers.

Before a device is unbound from IOMMU fd, it is always attached to a
security context (either the block-dma domain or an IOASID domain).
Switch between two domains is initiated by attaching the device to or
detaching it from an IOASID. The IOMMU layer should ensure that
the default domain is not implicitly re-attached in the switching
process, before the group is moved out of the block-dma state.

To stay on par with legacy VFIO, IOMMU fd could verify that all
bound devices in the same group must be attached to a single IOASID.

3) When a device fd is closed, VFIO automatically unbinds the device from
IOMMU fd before zapping the mmio mapping. Unbinding the last device
in the group moves the entire group out of the block-dma state and
re-attached to the default domain.

Actual implementation may use a staging approach, e.g. only support
one-device group in the start (leaving multi-devices group handled via
legacy VFIO uAPI) and then cover multi-devices group in a later stage.

If necessary, devices within a group may be further allowed to be
attached to different IOASIDs in the same IOMMU fd, in case that the
source devices can be reliably identifiable (e.g. due to !ACS). This will
require additional sub-group logic in the iommu layer and with
sub-group topology exposed to userspace. But no expectation of
changing the device-centric semantics except introducing sub-group
awareness within IOMMU fd.

A more detailed explanation of the staging approach can be found:

https://lore.kernel.org/linux-iommu/BN9PR11MB543382665D34E58155A9593C8C039@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

1.4. PASID Virtualization
+++++++++++++++++++++++++

As explained in {1.2}, PASID virtualization is required when multiple I/O
address spaces are supported on a device. The actual policy is per-device
thus defined by specific VFIO device driver.

A PASID virtualization policy is defined by four aspects:

1) Whether this device allows the user to create multiple I/O address
spaces (vPASID capability). This is decided upon whether this device
and its upstream IOMMU both support PASID.

2) If yes, whether the PASID space is delegated to the user, based on
whether the PASID table should be managed by user or kernel.

3) If no, the user should register vPASID to the kernel. Then the next
question is whether vPASID should be directly used for physical routing
(vPASID==pPASID or vPASID!=pPASID). The key is whether this device
must share the PASID space with others (pdev vs. mdev).

4) If vPASID!=pPASID, whether pPASID should be allocated from the
per-RID space or a global space. This is about whether the device
supports PCIe DMWr-type work submission (e.g. Intel ENQCMD) which
requires global pPASID allocation cross multiple devices.

Only vPASIDs are part of the VM state to be migrated in VM live migration.
This is basically about the virtual PASID table state in vendor vIOMMU. If
vPASID!=pPASID, new pPASIDs will be re-allocated on the destination and
VFIO device driver is responsible for programming the device to use the
new pPASID when restoring the device state.

Different policies may imply different uAPI semantics for user to follow
when attaching a device. The semantics information is expected to be
reported to the user via VFIO uAPI instead of via IOMMU fd, since the
latter only cares about pPASID. But if there is a different thought we'd
like to hear it.

Following sections (1.4.1 - 1.4.3) provide detail explanation on how
above are selected on different device types and the implication when
multiple types are mixed together (i.e. assigned to a single user). Last
section (1.4.4) then summarizes what uAPI semantics information is
reported and how user is expected to deal with it.

1.4.1. Devices which don't support DMWr
***************************************

This section is about following types:

1) a pdev which doesn't issue PASID;
2) a sw mdev which doesn't issue PASID;
3) a mdev which is programmed a fixed defPASID (for default I/O address
space), but does not expose vPASID capability;

4) a pdev which exposes vPASID and has its PASID table managed by user;
5) a pdev which exposes vPASID and has its PASID table managed by kernel;
6) a mdev which exposes vPASID and shares the parent's PASID table
with other mdev's;

+--------+---------+---------+----------+-----------+
| | |Delegated| vPASID== | per-RID |
| | vPASID | to user | pPASID | pPASID |
+========+=========+=========+==========+===========+
| type-1 | N/A | N/A | N/A | N/A |
+--------+---------+---------+----------+-----------+
| type-2 | N/A | N/A | N/A | N/A |
+--------+---------+---------+----------+-----------+
| type-3 | N/A | N/A | N/A | N/A |
+--------+---------+---------+----------+-----------+
| type-4 | Yes | Yes | v==p(*)| per-RID(*)|
+--------+---------+---------+----------+-----------+
| type-5 | Yes | No | v==p | per-RID |
+--------+---------+---------+----------+-----------+
| type-6 | Yes | No | v!=p | per-RID |
+--------+---------+---------+----------+-----------+
<* conceptual definition though the PASID space is fully delegated>

for 1-3 there is no vPASID capability exposed and the user can create
only one default I/O address space on this device. Thus there is no PASID
virtualization at all.

4) is specific to ARM/AMD platforms where the PASID table is managed by
the user. In this case the entire PASID space is delegated to the user
which just needs to create a single IOASID linked to the user-managed
PASID table, as placeholder covering all non-default I/O address spaces
on pdev. In concept this looks like a big 84bit address space (20bit
PASID + 64bit addr). vPASID may be carried in the uAPI data to help define
the operation scope when invalidating IOTLB or reporting I/O page fault.
IOMMU fd doesn't touch it and just acts as a channel for vIOMMU/pIOMMU to
exchange info.

5) is specific to Intel platforms where the PASID table is managed by
the kernel. In this case vPASIDs should be registered to the kernel
in the attaching call. This implies that every non-default I/O address
space on pdev is explicitly tracked by an unique IOASID in the kernel.
Because pdev is fully controlled by the user, its DMA request carries
vPASID as the routing informaiton thus requires VFIO device driver to
adopt vPASID==pPASID policy. Because an IOASID already represents a
standalone address space, there is no need to further carry vPASID in
the invalidation and fault paths.

6) is about mdev, as those enabled by Intel Scalable IOV. The main
difference from type-5) is on whether vPASID==pPASID. There is
only a single PASID table per the parent device, implying the per-RID
PASID space shared by all mdevs created on this parent. VFIO device
driver must use vPASID!=pPASID policy and allocate a pPASID from the
per-RID space for every registered vPASID to guarantee DMA isolation
between sibling mdev's. VFIO device driver needs to conduct vPASID->
pPASID conversion properly in several paths:

- When VFIO device driver provides the routing information in the
attaching call, since IOMMU fd only cares about pPASID;
- When VFIO device driver updates a PASID MMIO register in the
parent according to the vPASID intercepted in the mediation path;

1.4.2. Devices which support DMWr
*********************************

Modern devices may support a scalable workload submission interface
based on PCI Deferrable Memory Write (DMWr) capability, allowing a
single work queue to access multiple I/O address spaces. One example
using DMWr is Intel ENQCMD, having PASID saved in the CPU MSR and
carried in the non-posted DMWr payload when sent out to the device.
Then a single work queue shared by multiple processes can compose
DMAs toward different address spaces, by carrying the PASID value
retrieved from the DMWr payload. The role of DMWr is allowing the
shared work queue to return a retry response when the work queue
is under pressure (due to capacity or QoS). Upon such response the
software could try re-submitting the descriptor.

When ENQCMD is executed in the guest, the value saved in the CPU
MSR is vPASID (part of the xsave state). This creates another point for
consideration regarding to PASID virtualization.

Two device types are relevant:

7) a pdev same as 5) plus DMWr support;
8) a mdev same as 6) plus DMWr support;

and respective polices:

+--------+---------+---------+----------+-----------+
| | |Delegated| vPASID== | per-RID |
| | vPASID | to user | pPASID | pPASID |
+========+=========+=========+==========+===========+
| type-7 | Yes | Yes | v==p | per-RID |
+--------+---------+---------+----------+-----------+
| type-8 | Yes | Yes | v!=p | global |
+--------+---------+---------+----------+-----------+

DMWr or shared mode is configurable per work queue. It's completely
sane if an assigned device with multiple queues needs to handle both
DMWr (shared work queue) and normal write (dedicated work queue)
simultaneously. Thus the PASID virtualization policy must be consistent
when both paths are activated.

for 7) we should use the same policy as 5), i.e. directly using vPASID
for physical routing on pdev. In this case ENQCMD in the guest just works
w/o additional work because the vPASID saved in the PASID_MSR
matches the routing information configured for the target I/O address
space in the IOMMU. When receiving a DMWr request, the shared
work queue grabs vPASID from the payload and then tags outgoing
DMAs with vPASID. This is consistent with the dedicated work queue
path where vPASID is grabbed from the MMIO register to tag DMAs.

for 8) vPASID in the PASID_MSR must be converted to pPASID before
sent to the wire (given vPASID!=pPASID for the same reason as 6).
Intel CPU provides a hardware PASID translation capability for auto-
conversion when ENQCMD is being executed. In this case the payload
received by the work queue contains pPASID thus outgoing DMAs are
tagged with pPASID. This is consistent with the dedicated work
queue path where pPASID is programmed to the MMIO register in the
mediation path and then grabbed to tag DMAs.

However, the CPU translation structure is per-VM which implies
that a same pPASID must be used cross all type-8 devices (of this VM)
given a vPASID. This requires the pPASID allocated from a global pool by
the first type-8 device and then shared by the following type-8 devices
when they are attached to the same vPASID.

CPU translation capability is enabled via KVM uAPI. We need a secure
contract between VFIO device fd and KVM fd so VFIO device driver knows
when it's secure to allow guest access to the cmd portal of the type-8
device. It's dangerous by allowing the guest to issue ENQCMD to the
device before CPU is ready for PASID translation. In this window the
vPASID is untranslated thus grants the guest to access random I/O
address space on the parent of this mdev.

We plan to utilize existing kvm-vfio contract. It is currently used for
multiple purposes including propagating the kvm pointer to the VFIO
device driver. It can be extended to further notify whether CPU PASID
translation capability is turned on. Before receiving this notification,
the VFIO device driver should not allow user to access the DMWr-capable
work queue on type-8 device.

1.4.3. Mix different types together
***********************************

In majority case mixing different types doesn't change the aforementioned
PASID virtualization policy for each type. The user just needs to handle
them per device basis.

There is one exception though, when mixing type 7) and 8) together,
due to conflicting policies on how PASID_MSR should be handled.
For mdev (type-8) the CPU translation capability must be enabled to
prevent a malicious guest from doing bad things. But once per-VM
PASID translation is enabled, the shared work queue of pdev (type-7)
will also receive a pPASID allocated for mdev instead of the vPASID
that is expected on this pdev.

Fixing this exception for pdev is not easy. There are three options.

One is moving pdev to also accept pPASID. Because pdev may have both
shared work queue (PASID in MSR) and dedicated work queue (PASID
in MMIO) enabled by the guest, this requires VFIO device driver to
mediate the dedicated work queue path so vPASIDs programmed by
the guest are manually translated to pPASIDs before written to the
pdev. This may add undesired software complexity and potential
performance impact if the PASID register locates alongside other
fast-path resources in the same 4K page. If it works it essentially
converts type-7 to type-8 from user p.o.v.

The second option is using an enlightened approach so the guest
directly use the host-allocated pPASIDs instead of creating its own vPASID
space. In this case even the dedicated work queue path uses pPASID w/o
the need of mediation. However this requires different uAPI semantics
(from register-vPASID to return-pPASID) and exposes pPASID knowledge
to userspace which also implies breaking VM live migration.

The third option is making pPASID as an alias routing info to vPASID
and having both linked to the same I/O page table in the IOMMU, so
either way can hit the desired address space. This further requires sort
of range split scheme to avoid conflict between vPASID and pPASID.
However, we haven't found a clear way to fold this trick into this uAPI
proposal yet. and this option may not work when PASID is also used to
tag the IMS entry for verifying the interrupt source. In this case there
is no room for aliasing.

So, none of above can work cleanly based on current thoughts. We
decide to not support type-7/8 mix in this proposal. User could detect
this exception based on reported PASID flags, as outlined in next section.

1.4.4. User sequence
********************

A new PASID capability info could be introduced to VFIO_DEVICE_GET_INFO.
The presence indicates allowing the user to create multiple I/O address
spaces with vPASID on the device. This capability further includes
following flags to help describe the desired uAPI semantics:

- PASID_DELEGATED; // PASID space delegated to the user?
- PASID_CPU; // Allow vPASID used in the CPU?
- PASID_CPU_VIRT; // Require vPASID translation in the CPU?

The last two flags together help the user to detect the unsupported
type 7/8 mix scenario.

Take Qemu for example. It queries above flags for every vfio device at
initialization time, after identifying the PASID capability:

1) If PASID_DELEGATED is set, the PASID space is fully managed by the
user thus a single IOASID (linked to user-managed page table) is
required as the placeholder for all non-default I/O address spaces
on the device.

If not set, an IOASID must be created for every non-default I/O address
space on this device and vPASID must be registered to the kernel
when attaching the device to this IOASID.

User may want to sanity check on all devices with the same setting
as this flag is a platform attribute though it's exported per device.

If not set, continue to step 2.

2) If PASID_CPU is not set, done.

Otherwise check whether the PASID_CPU_VIRT flag on this device is
consistent with all other devices with PASID_CPU set.

If inconsistency is found (indicating type 7/8 mix), only one type
of devices (all set, or all clear) should have the vPASID capability
exposed to the guest.

3) If PASID_CPU_VIRT is not set, done.

If set and consistency check in 2) is passed, call KVM uAPI to
enable CPU PASID translation if it is the first device with this flag
set. Later when a new vPASID is identified through vIOMMU at run-time,
call another KVM uAPI to update the corresponding PASID mapping.

1.5. No-snoop DMA
++++++++++++++++++++

Snoop behavior of a DMA specifies whether the access is coherent (snoops
the processor caches) or not. The snoop behavior is decided by both device
and IOMMU. Device can set a no-snoop attribute in DMA request to force
the non-coherent behavior, while IOMMU may support a configuration which
enforces DMAs to be coherent (with the no-snoop attribute ignored).

No-snoop DMA requires the driver to manually flush caches for
observing the latest content. When such driver is running in the guest,
it further requires KVM to intercept/emulate WBINVD plus favoring
guest cache attributes in the EPT page table.

Alex helped create a matrix as below:
(https://lore.kernel.org/linux-iommu/PH0PR12MB54811863B392C644E5365446DC3E9@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/T/#mbfc96278b078d3ec07eabb9aa46abfe03a886dc6)

\ Device supports
IOMMU enforces\ no-snoop
snoop \ yes | no |
----------------+-----+-----+
yes | 1 | 2 |
----------------+-----+-----+
no | 3 | 4 |
----------------+-----+-----+

DMA is always coherent in boxes {1, 2, 4}. No-snoop DMA is allowed
in {3} but whether it is actually used is a driver decision.

VFIO currently adopts a simple policy - always turn on IOMMU enforce-
snoop if available. It provides a contract via kvm-vfio fd for KVM to
learn whether no-snoop DMA is used thus special tricks on WBINVD
and EPT must be enabled. However, the criteria of no-snoop DMA is
solely based on the fact of lacking IOMMU enforce-snoop for any vfio
device, i.e. both 3) and 4) are considered capable of doing no-snoop
DMA. This model has several limitations:

- It's impossible to move a device from 1) to 3) when no-snoop DMA
is a must to achieve the desired user experience;

- Unnecessary overhead in KVM side in 4) or if the driver doesn't do
no-snoop DMA in 3). Although the driver doesn't use WBINVD, the
guest still uses WBINVD in other places e.g. when changing cache-
related registers (e.g. MTRR/CR0);

We want to adopt an user-driven model in /dev/iommu for more accurate
control over the no-snoop usage. In this model the enforce-snoop format
is specified when an IOASID is created, while the device no-snoop usage
can be further clarified when it's attached to the IOASID.

IOMMU fd is expected to provide uAPIs and helper functions for:

- reporting IOMMU enforce-snoop capability to the user per device
cookie (device no-snoop capability is reported via VFIO).

- allowing user to specify whether an IOASID should be created in the
IOMMU enforce-snoop format (enable/disable/auto):

* This allows moving a device from 1) to 3) in case of performance
requirement.

* 'auto' falls back to the legacy VFIO policy, i.e. always enables
enforce-snoop if available.

* Any device can be attached to a non-enforce-snoop IOASID,
because this format is supported by all IOMMUs. In this case the
device belongs to {3, 4} and whether it is considered doing no-snoop
DMA is decided by the next interface.

* Attaching a device which cannot be forced to snoop by its IOMMU
to an enforce-snoop IOASID gets a failure. Successful attaching
implies the device always does snoop DMA, i.e. belonging to {1,2}.

* Some platform supports page-granular enforce-snoop. One open
is whether a page-granular interface is necessary here.

- allowing user to further hint whether no-snoop DMA is actually used
in {3, 4} on a specific IOASID, via the VFIO attaching call:

* in case the user has such intrinsic knowledge on a specific device.

* {3} can be filtered out with this hint.

* {4} can be filtered out automatically by VFIO device driver,
based on device no-snoop capability.

* If no hint is provided, fall back to legacy VFIO policy, i.e.
treating all devices in {3, 4} as capable of doing no-snoop.

- a new contract for KVM to learn whether any IOASID is attached by
devices which require no-snoop DMA:

* Once we thought existing kvm-vfio fd can be leveraged as a short
term approach (see above link). However kvm-vfio is centralized
on vfio group concept, while this proposal is moving to device-
centric model.

* The new contract will allows KVM to query no-snoop requirement
per IOMMU fd. This will apply to all passthrough frameworks.

* A notification mechanism might be introduced to switch between
WBINVD emulation and no-op intercept according to device
attaching status change in registered IOMMU fd.

* whether kvm-vfio will be completely deprecated is a TBD. It's
still used for non-iommu related contract, e.g. notifying kvm
pointer to mdev driver and pvIOMMU acceleration in PPC.

- optional bulk cache invalidation:

* Userspace driver can use clflush to invalidate cachelines for
buffers used for no-snoop DMA. But this may be inefficient when
a big buffer needs to be invalidated. In this case a bulk
invalidation could be provided based on WBINVD.

The implementation might be a staging approach. In the start IOMMU fd
only support devices which can be forced to snoop via the IOMMU (i.e.
{1, 2}), while leaving {3, 4} still handled via legacy VFIO. In
this case no need to introduce new contract with KVM. An easy way is
having VFIO not expose {3, 4} devices in /dev/vfio/devices. Then we have
plenty of time to figure out the implementation detail of the new model
at a later stage.

2. uAPI Proposal
----------------------

/dev/iommu uAPI covers everything about managing I/O address spaces.

/dev/vfio device uAPI builds connection between devices and I/O address
spaces.

/dev/kvm uAPI is optionally required as far as no-snoop DMA or ENQCMD
is concerned.

2.1. /dev/iommu uAPI
++++++++++++++++++++

/*
* Check whether an uAPI extension is supported.
*
* It's unlikely that all planned capabilities in IOMMU fd will be ready in
* one breath. User should check which uAPI extension is supported
* according to its intended usage.
*
* A rough list of possible extensions may include:
*
* - EXT_MAP_TYPE1V2 for vfio type1v2 map semantics;
* - EXT_MAP_NEWTYPE for an enhanced map semantics;
* - EXT_IOASID_NESTING for what the name stands;
* - EXT_USER_PAGE_TABLE for user managed page table;
* - EXT_USER_PASID_TABLE for user managed PASID table;
* - EXT_MULTIDEV_GROUP for 1:N iommu group;
* - EXT_DMA_NO_SNOOP for no-snoop DMA support;
* - EXT_DIRTY_TRACKING for tracking pages dirtied by DMA;
* - ...
*
* Return: 0 if not supported, 1 if supported.
*/
#define IOMMU_CHECK_EXTENSION _IO(IOMMU_TYPE, IOMMU_BASE + 0)


/*
* Check capabilities and format information on a bound device.
*
* It could be reported either via a capability chain as implemented in
* VFIO or a per-capability query interface. The device is identified
* by device cookie (registered when binding this device).
*
* Sample capability info:
* - VFIO type1 map: supported page sizes, permitted IOVA ranges, etc.;
* - IOASID nesting: hardware nesting vs. software nesting;
* - User-managed page table: vendor specific formats;
* - User-managed pasid table: vendor specific formats;
* - coherency: whether IOMMU can enforce snoop for this device;
* - ...
*
*/
#define IOMMU_DEVICE_GET_INFO _IO(IOMMU_TYPE, IOMMU_BASE + 1)


/*
* Allocate an IOASID.
*
* IOASID is the FD-local software handle representing an I/O address
* space. Each IOASID is associated with a single I/O page table. User
* must call this ioctl to get an IOASID for every I/O address space that is
* intended to be tracked by the kernel.
*
* User needs to specify the attributes of the IOASID and associated
* I/O page table format information according to one or multiple devices
* which will be attached to this IOASID right after. The I/O page table
* is activated in the IOMMU when it's attached by a device. Incompatible
* format between device and IOASID will lead to attaching failure.
*
* The root IOASID should always have a kernel-managed I/O page
* table for safety. Locked page accounting is also conducted on the root.
*
* Multiple roots are possible, e.g. when multiple I/O address spaces
* are created but IOASID nesting is disabled. However, one page might
* be accounted multiple times in this case. The user is recommended to
* instead create a 'dummy' root with identity mapping (HVA->HVA) for
* centralized accounting, nested by all other IOASIDs which represent
* 'real' I/O address spaces.
*
* Sample attributes:
* - Ownership: kernel-managed or user-managed I/O page table;
* - IOASID nesting: the parent IOASID info if enabled;
* - User-managed page table: addr and vendor specific formats;
* - User-managed pasid table: addr and vendor specific formats;
* - coherency: enforce-snoop;
* - ...
*
* Return: allocated ioasid on success, -errno on failure.
*/
#define IOMMU_IOASID_ALLOC _IO(IOMMU_TYPE, IOMMU_BASE + 2)
#define IOMMU_IOASID_FREE _IO(IOMMU_TYPE, IOMMU_BASE + 3)


/*
* Map/unmap process virtual addresses to I/O virtual addresses.
*
* Provide VFIO type1 equivalent semantics. Start with the same
* restriction e.g. the unmap size should match those used in the
* original mapping call.
*
* If the specified IOASID is the root, the mapped pages are automatically
* pinned and accounted as locked memory. Pinning might be postponed
* until the IOASID is attached by a device. Software mdev driver may
* further provide a hint to skip auto-pinning at attaching time, since
* it does selective pinning at run-time. auto-pinning can be also
* skipped when I/O page fault is enabled on the root.
*
* When software nesting is enabled, this implies that the merged
* shadow mapping will also be updated accordingly. However if the
* change happens on the parent, it requires reverse lookup to update
* all relevant child mappings which is time consuming. So the user
* is not suggested to change the parent mapping after the software
* nesting is established (maybe disallow?). There is no such restriction
* with hardware nesting, as the IOMMU will catch up the change
* when actually walking the page table.
*
* Input parameters:
* - u32 ioasid;
* - refer to vfio_iommu_type1_dma_{un}map
*
* Return: 0 on success, -errno on failure.
*/
#define IOMMU_MAP_DMA _IO(IOMMU_TYPE, IOMMU_BASE + 4)
#define IOMMU_UNMAP_DMA _IO(IOMMU_TYPE, IOMMU_BASE + 5)


/*
* Invalidate IOTLB for an user-managed I/O page table
*
* check include/uapi/linux/iommu.h for supported cache types and
* granularities. Device cookie and vPASID may be specified to help
* decide the scope of this operation.
*
* Input parameters:
* - child_ioasid;
* - granularity (per-device, per-pasid, range-based);
* - cache type (iotlb, devtlb, pasid cache);
*
* Return: 0 on success, -errno on failure
*/
#define IOMMU_INVALIDATE_CACHE _IO(IOMMU_TYPE, IOMMU_BASE + 6)


/*
* Page fault report and response
*
* This is TBD. Can be added after other parts are cleared up. It may
* include a fault region to report fault data via read()), an
* eventfd to notify the user and an ioctl to complete the fault.
*
* The fault data includes {IOASID, device_cookie, faulting addr, perm}
* as common info. vendor specific fault info can be also included if
* necessary.
*
* If the IOASID represents an user-managed PASID table, the vendor
* fault info includes vPASID information for the user to figure out
* which I/O page table triggers the fault.
*
* If the IOASID represents an user-managed I/O page table, the user
* is expected to find out vPASID itself according to {IOASID, device_
* cookie}.
*/


/*
* Dirty page tracking
*
* Track and report memory pages dirtied in I/O address spaces. There
* is an ongoing work by Kunkun Jiang by extending existing VFIO type1.
* It needs be adapted to /dev/iommu later.
*/


2.2. /dev/vfio device uAPI
++++++++++++++++++++++++++

/*
* Bind a vfio_device to the specified IOMMU fd
*
* The user should provide a device cookie when calling this ioctl. The
* cookie is later used in IOMMU fd for capability query, iotlb invalidation
* and I/O fault handling.
*
* User is not allowed to access the device before the binding operation
* is completed.
*
* Unbind is automatically conducted when device fd is closed.
*
* Input parameters:
* - iommu_fd;
* - cookie;
*
* Return: 0 on success, -errno on failure.
*/
#define VFIO_BIND_IOMMU_FD _IO(VFIO_TYPE, VFIO_BASE + 22)


/*
* Report vPASID info to userspace via VFIO_DEVICE_GET_INFO
*
* Add a new device capability. The presence indicates that the user
* is allowed to create multiple I/O address spaces on this device. The
* capability further includes following flags:
*
* - PASID_DELEGATED, if clear every vPASID must be registered to
* the kernel;
* - PASID_CPU, if set vPASID is allowed to be carried in the CPU
* instructions (e.g. ENQCMD);
* - PASID_CPU_VIRT, if set require vPASID translation in the CPU;
*
* The user must check that all devices with PASID_CPU set have the
* same setting on PASID_CPU_VIRT. If mismatching, it should enable
* vPASID only in one category (all set, or all clear).
*
* When the user enables vPASID on the device with PASID_CPU_VIRT
* set, it must enable vPASID CPU translation via kvm fd before attempting
* to use ENQCMD to submit work items. The command portal is blocked
* by the kernel until the CPU translation is enabled.
*/
#define VFIO_DEVICE_INFO_CAP_PASID 5


/*
* Attach a vfio device to the specified IOASID
*
* Multiple vfio devices can be attached to the same IOASID, and vice
* versa.
*
* User may optionally provide a "virtual PASID" to mark an I/O page
* table on this vfio device, if PASID_DELEGATED is not set in device info.
* Whether the virtual PASID is physically used or converted to another
* kernel-allocated PASID is a policy in the kernel.
*
* Because one device is allowed to bind to multiple IOMMU fd's, the
* user should provide both iommu_fd and ioasid for this attach operation.
*
* Input parameter:
* - iommu_fd;
* - ioasid;
* - flag;
* - vpasid (if specified);
*
* Return: 0 on success, -errno on failure.
*/
#define VFIO_ATTACH_IOASID _IO(VFIO_TYPE, VFIO_BASE + 23)
#define VFIO_DETACH_IOASID _IO(VFIO_TYPE, VFIO_BASE + 24)


2.3. KVM uAPI
+++++++++++++

/*
* Check/enable CPU PASID translation via KVM CAP interface
*
* This is necessary when ENQCMD will be used in the guest while the
* targeted device doesn't accept the vPASID saved in the CPU MSR.
*/
#define KVM_CAP_PASID_TRANSLATION 206


/*
* Update CPU PASID mapping
*
* This command allows user to set/clear the vPASID->pPASID mapping
* in the CPU, by providing the IOASID (and FD) information representing
* the I/O address space marked by this vPASID. KVM calls iommu helper
* function to retrieve pPASID according to the input parameters. So the
* pPASID value is completely hidden from the user.
*
* Input parameters:
* - user_pasid;
* - iommu_fd;
* - ioasid;
*/
#define KVM_MAP_PASID _IO(KVMIO, 0xf0)
#define KVM_UNMAP_PASID _IO(KVMIO, 0xf1)


/*
* and a new contract to exchange no-snoop dma status with IOMMU fd.
* this will be a device-centric interface, thus existing vfio-kvm contract
* is not suitable as it's group-centric.
*
* actual definition TBD.
*/


3. Sample structures and helper functions
--------------------------------------------------------

Three helper functions are provided to support VFIO_BIND_IOMMU_FD:

struct iommu_ctx *iommu_ctx_fdget(int fd);
struct iommu_dev *iommu_register_device(struct iommu_ctx *ctx,
struct device *device, u64 cookie);
int iommu_unregister_device(struct iommu_dev *dev);

An iommu_ctx is created for each fd:

struct iommu_ctx {
// a list of allocated IOASID data's
struct xarray ioasid_xa;

// a list of registered devices
struct xarray dev_xa;
};

Later some group-tracking fields will be also introduced to support
multi-devices group.

Each registered device is represented by iommu_dev:

struct iommu_dev {
struct iommu_ctx *ctx;
// always be the physical device
struct device *device;
u64 cookie;
struct kref kref;
};

A successful binding establishes a security context for the bound
device and returns struct iommu_dev pointer to the caller. After this
point, the user is allowed to query device capabilities via IOMMU_
DEVICE_GET_INFO.

For mdev the struct device should be the pointer to the parent device.

An ioasid_data is created when IOMMU_IOASID_ALLOC, as the main
object describing characteristics about an I/O page table:

struct ioasid_data {
struct iommu_ctx *ctx;

// the IOASID number
u32 ioasid;

// the handle for kernel-managed I/O page table
struct iommu_domain *domain;

// map metadata (vfio type1 semantics)
struct rb_node dma_list;

// pointer to user-managed pgtable
u64 user_pgd;

// link to the parent ioasid (for nesting)
struct ioasid_data *parent;

// IOMMU enforce-snoop
bool enforce_snoop;

// various format information
...

// a list of device attach data (routing information)
struct list_head attach_data;

// a list of fault_data reported from the iommu layer
struct list_head fault_data;

...
}

iommu_domain is the object for operating the kernel-managed I/O
page tables in the IOMMU layer. ioasid_data is associated to an
iommu_domain explicitly or implicitly:

- root IOASID (except the 'dummy' one for locked accounting)
must use kernel-manage I/O page table thus always linked to an
iommu_domain;

- child IOASID (via software nesting) is explicitly linked to an iommu
domain as the shadow I/O page table is managed by the kernel;

- child IOASID (via hardware nesting) is linked to another simpler iommu
layer object (TBD) for tracking user-managed page table. Due to
nesting it is also implicitly linked to the iommu_domain of the
parent;

Following link has an initial discussion on this part:

https://lore.kernel.org/linux-iommu/BN9PR11MB54331FC6BB31E8CBF11914A48C019@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/T/#m2c19d3825cc096daf2026ea94e00cc5858cda321

As Jason recommends in v1, bus-specific wrapper functions are provided
explicitly to support VFIO_ATTACH_IOASID, e.g.

struct iommu_attach_data * iommu_pci_device_attach(
struct iommu_dev *dev, struct pci_device *pdev,
u32 ioasid);
struct iommu_attach_data * iommu_pci_device_attach_pasid(
struct iommu_dev *dev, struct pci_device *pdev,
u32 ioasid, u32 pasid);

and variants for non-PCI devices.

A helper function is provided for above wrappers:

// flags specifies whether pasid is valid
struct iommu_attach_data *__iommu_device_attach(
struct ioasid_dev *dev, u32 ioasid, u32 pasid, int flags);

A new object is introduced and linked to ioasid_data->attach_data for
each successful attach operation:

struct iommu_attach_data {
struct list_head next;
struct iommu_dev *dev;
u32 pasid;
}

The helper function for VFIO_DETACH_IOASID is generic:

int iommu_device_detach(struct iommu_attach_data *data);

4. Use Cases and Flows
-------------------------------

Here assume VFIO will support a new model where /dev/iommu capable
devices are explicitly listed under /dev/vfio/devices thus a device fd can
be acquired w/o going through legacy container/group interface. They
maybe further categorized into sub-directories based on device types
(e.g. pdev, mdev, etc.). For illustration purpose those devices are putting
together and just called dev[1...N]:

device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);

VFIO continues to support container/group model for legacy applications
and also for devices which are not moved to /dev/iommu in one breath
(e.g. in a group with multiple devices, or support no-snoop DMA). In concept
there is no problem for VFIO to support two models simultaneously, but
we'll wait to see any issue when reaching implementation.

As explained earlier, one IOMMU fd is sufficient for all intended use cases:

iommu_fd = open("/dev/iommu", mode);

For simplicity below examples are all made for the virtualization story.
They are representative and could be easily adapted to a non-virtualization
scenario.

Three types of IOASIDs are considered:

gpa_ioasid[1...N]: GPA as the default address space
giova_ioasid[1...N]: GIOVA as the default address space (nesting)
gva_ioasid[1...N]: CPU VA as non-default address space (nesting)

At least one gpa_ioasid must always be created per guest, while the other
two are relevant as far as vIOMMU is concerned.

Examples here apply to both pdev and mdev. VFIO device driver in the
kernel will figure out the associated routing information in the attaching
operation.

For illustration simplicity, IOMMU_CHECK_EXTENSION and IOMMU_DEVICE_
GET_INFO are skipped in these examples. No-snoop DMA is also not covered here.

Below examples may not apply to all platforms. For example, the PAPR IOMMU
in PPC platform always requires a vIOMMU and blocks DMAs until the device is
explicitly attached to an GIOVA address space. there are even fixed
associations between available GIOVA spaces and devices. Those platform
specific variances are not covered here and will be figured out in the
implementation phase.

4.1. A simple example
+++++++++++++++++++++

Dev1 is assigned to the guest. A cookie has been allocated by the user
to represent this device in the iommu_fd.

One gpa_ioasid is created. The GPA address space is managed through
DMA mapping protocol by specifying that the I/O page table is managed
by the kernel:

/* Bind device to IOMMU fd */
device_fd = open("/dev/vfio/devices/dev1", mode);
iommu_fd = open("/dev/iommu", mode);
bind_data = {.fd = iommu_fd; .cookie = cookie};
ioctl(device_fd, VFIO_BIND_IOASID_FD, &bind_data);

/* Allocate IOASID */
alloc_data = {.user_pgtable = false};
gpa_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);

/* Attach device to IOASID */
at_data = { .fd = iommu_fd; .ioasid = gpa_ioasid};
ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);

/* Setup GPA mapping [0 - 1GB] */
dma_map = {
.ioasid = gpa_ioasid;
.iova = 0; // GPA
.vaddr = 0x40000000; // HVA
.size = 1GB;
};
ioctl(iommu_fd, IOMMU_MAP_DMA, &dma_map);

If the guest is assigned with more than dev1, the user follows above
sequence to attach other devices to the same gpa_ioasid i.e. sharing
the GPA address space cross all assigned devices, e.g. for dev2:

bind_data = {.fd = iommu_fd; .cookie = cookie2};
ioctl(device_fd2, VFIO_BIND_IOASID_FD, &bind_data);
ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

4.2. Multiple IOASIDs (no nesting)
++++++++++++++++++++++++++++++++++

Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
both devices are attached to gpa_ioasid. After boot the guest creates
a GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
through mode (gpa_ioasid).

Suppose IOASID nesting is not supported in this case. Qemu needs to
generate shadow mappings in userspace for giova_ioasid (like how
VFIO works today). The side-effect is that duplicated locked page
accounting might be incurred in this example as there are two root
IOASIDs now. It will be fixed once IOASID nesting is supported:

device_fd1 = open("/dev/vfio/devices/dev1", mode);
device_fd2 = open("/dev/vfio/devices/dev2", mode);
iommu_fd = open("/dev/iommu", mode);

/* Bind device to IOMMU fd */
bind_data = {.fd = iommu_fd; .device_cookie = cookie1};
ioctl(device_fd1, VFIO_BIND_IOASID_FD, &bind_data);
bind_data = {.fd = iommu_fd; .device_cookie = cookie2};
ioctl(device_fd2, VFIO_BIND_IOASID_FD, &bind_data);

/* Allocate IOASID */
alloc_data = {.user_pgtable = false};
gpa_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);

/* Attach dev1 and dev2 to gpa_ioasid */
at_data = { .fd = iommu_fd; .ioasid = gpa_ioasid};
ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

/* Setup GPA mapping [0 - 1GB] */
dma_map = {
.ioasid = gpa_ioasid;
.iova = 0; // GPA
.vaddr = 0x40000000; // HVA
.size = 1GB;
};
ioctl(iommu_fd, IOMMU_MAP_DMA, &dma_map);

/* After boot, guest enables a GIOVA space for dev2 via vIOMMU */
alloc_data = {.user_pgtable = false};
giova_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);

/* First detach dev2 from previous address space */
at_data = { .fd = iommu_fd; .ioasid = gpa_ioasid};
ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);

/* Then attach dev2 to the new address space */
at_data = { .fd = iommu_fd; .ioasid = giova_ioasid};
ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

/* Setup a shadow DMA mapping according to vIOMMU.
*
* e.g. the vIOMMU page table adds a new 4KB mapping:
* GIOVA [0x2000] -> GPA [0x1000]
*
* and GPA [0x1000] is mapped to HVA [0x40001000] in gpa_ioasid.
*
* In this case the shadow mapping should be:
* GIOVA [0x2000] -> HVA [0x40001000]
*/
dma_map = {
.ioasid = giova_ioasid;
.iova = 0x2000; // GIOVA
.vaddr = 0x40001000; // HVA
.size = 4KB;
};
ioctl(iommu_fd, IOMMU_MAP_DMA, &dma_map);

4.3. IOASID nesting (software)
++++++++++++++++++++++++++++++

Same usage scenario as 4.2, with software-based IOASID nesting
available. In this mode it is the kernel instead of user to create the
shadow mapping.

The flow before guest boots is same as 4.2, except one point. Because
giova_ioasid is nested on gpa_ioasid, locked accounting is only
conducted for gpa_ioasid which becomes the only root.

There could be a case where different gpa_ioasids are created due
to incompatible format between dev1/dev2 (e.g. about IOMMU
enforce-snoop). In such case the user could further created a dummy
IOASID (HVA->HVA) as the root parent for two gpa_ioasids to avoid
duplicated accounting. But this scenario is not covered in following
flows.

To save space we only list the steps after boots (i.e. both dev1/dev2
have been attached to gpa_ioasid before guest boots):

/* After boots */
/* Create GIOVA space nested on GPA space
* Both page tables are managed by the kernel
*/
alloc_data = {.user_pgtable = false; .parent = gpa_ioasid};
giova_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);

/* Attach dev2 to the new address space (child)
* Note dev2 is still attached to gpa_ioasid (parent)
*/
at_data = { .fd = iommu_fd; .ioasid = giova_ioasid};
ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

/* Setup a GIOVA [0x2000] ->GPA [0x1000] mapping for giova_ioasid,
* based on the vIOMMU page table. The kernel is responsible for
* creating the shadow mapping GIOVA [0x2000] -> HVA [0x40001000]
* by walking the parent's I/O page table to find out GPA [0x1000] ->
* HVA [0x40001000].
*/
dma_map = {
.ioasid = giova_ioasid;
.iova = 0x2000; // GIOVA
.vaddr = 0x1000; // GPA
.size = 4KB;
};
ioctl(iommu_fd, IOMMU_MAP_DMA, &dma_map);

4.4. IOASID nesting (hardware)
++++++++++++++++++++++++++++++

Same usage scenario as 4.2, with hardware-based IOASID nesting
available. In this mode the I/O page table is managed by userspace
thus an invalidation interface is used for the user to request iotlb
invalidation.

/* After boots */
/* Create GIOVA space nested on GPA space.
* Claim it's an user-managed I/O page table.
*/
alloc_data = {
.user_pgtable = true;
.parent = gpa_ioasid;
.addr = giova_pgtable;
// and format information;
};
giova_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);

/* Attach dev2 to the new address space (child)
* Note dev2 is still attached to gpa_ioasid (parent)
*/
at_data = { .fd = iommu_fd; .ioasid = giova_ioasid};
ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

/* Invalidate IOTLB when required */
inv_data = {
.ioasid = giova_ioasid;
// granular/cache type information
};
ioctl(iommu_fd, IOMMU_INVALIDATE_CACHE, &inv_data);

/* See 4.6 for I/O page fault handling */

4.5. Guest SVA (vSVA)
+++++++++++++++++++++

After boots the guest further creates a GVA address spaces (vpasid1) on
dev1. Dev2 is not affected (still attached to giova_ioasid).

As explained in section 1.4, the user should check the PASID capability
exposed via VFIO_DEVICE_GET_INFO and follow the required uAPI
semantics when doing the attaching call:

/****** If dev1 reports PASID_DELEGATED=false **********/
/* After boots */
/* Create GVA space nested on GPA space.
* Claim it's an user-managed I/O page table.
*/
alloc_data = {
.user_pgtable = true;
.parent = gpa_ioasid;
.addr = gva_pgtable;
// and format information;
};
gva_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);

/* Attach dev1 to the new address space (child) and specify
* vPASID. Note dev1 is still attached to gpa_ioasid (parent)
*/
at_data = {
.fd = iommu_fd;
.ioasid = gva_ioasid;
.flag = IOASID_ATTACH_VPASID;
.vpasid = vpasid1;
};
ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);

/* Enable CPU PASID translation if required */
if (PASID_CPU and PASID_CPU_VIRT are both true for dev1) {
pa_data = {
.iommu_fd = iommu_fd;
.ioasid = gva_ioasid;
.vpasid = vpasid1;
};
ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);
};

/* Invalidate IOTLB when required */
...

/****** If dev1 reports PASID_DELEGATED=true **********/
/* Create user-managed vPASID space when it's enabled via vIOMMU */
alloc_data = {
.user_pasid_table = true;
.parent = gpa_ioasid;
.addr = gpasid_tbl;
// and format information;
};
pasidtbl_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);

/* Attach dev1 to the vPASID space */
at_data = {.fd = iommu_fd; .ioasid = pasidtbl_ioasid};
ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);

/* from now on all GVA address spaces on dev1 are represented by
* a single pasidtlb_ioasid as the placeholder in the kernel.
*
* But iotlb invalidation and fault handling are still per GVA
* address space. They are still going through IOMMU fd in the
* same way as PASID_DELEGATED=false scenario
*/
...

4.6. I/O page fault
+++++++++++++++++++

uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
to guest IOMMU driver and backwards. This flow assumes that I/O page faults
are reported via IOMMU interrupts. Some devices report faults via device
specific way instead of going through the IOMMU. That usage is not covered
here:

- Host IOMMU driver receives a I/O page fault with raw fault_data {rid,
pasid, addr};

- Host IOMMU driver identifies the faulting I/O page table according to
{rid, pasid} and calls the corresponding fault handler with an opaque
object (registered by the handler) and raw fault_data (rid, pasid, addr);

- IOASID fault handler identifies the corresponding ioasid and device
cookie according to the opaque object, generates an user fault_data
(ioasid, cookie, addr) in the fault region, and triggers eventfd to
userspace;

* In case ioasid represents a pasid table, pasid is also included as
additional fault_data;

* the raw fault_data is also cached in ioasid_data->fault_data and
used when generating response;

- Upon received event, Qemu needs to find the virtual routing information
(v_rid + v_pasid) of the device attached to the faulting ioasid;

* v_rid is identified according to device_cookie;

* v_pasid is either identified according to ioasid, or already carried
in the fault data;

- Qemu generates a virtual I/O page fault through vIOMMU into guest,
carrying the virtual fault data (v_rid, v_pasid, addr);

- Guest IOMMU driver fixes up the fault, updates the guest I/O page table
(GIOVA or GVA), and then sends a page response with virtual completion
data (v_rid, v_pasid, response_code) to vIOMMU;

- Qemu finds the pending fault event, converts virtual completion data
into (ioasid, cookie, response_code), and then calls a /dev/iommu ioctl to
complete the pending fault;

- /dev/iommu finds out the pending fault data {rid, pasid, addr} saved in
ioasid_data->fault_data, and then calls iommu api to complete it with
{rid, pasid, response_code};