Re: [PATCH v2 1/9] docs: Document IO Address Space ID (IOASID) APIs

From: Auger Eric
Date: Thu Aug 27 2020 - 12:21:45 EST


Hi Jacob,
On 8/24/20 12:32 PM, Jean-Philippe Brucker wrote:
> On Fri, Aug 21, 2020 at 09:35:10PM -0700, Jacob Pan wrote:
>> IOASID is used to identify address spaces that can be targeted by device
>> DMA. It is a system-wide resource that is essential to its many users.
>> This document is an attempt to help developers from all vendors navigate
>> the APIs. At this time, ARM SMMU and Intel’s Scalable IO Virtualization
>> (SIOV) enabled platforms are the primary users of IOASID. Examples of
>> how SIOV components interact with IOASID APIs are provided in that many
>> APIs are driven by the requirements from SIOV.
>>
>> Signed-off-by: Liu Yi L <yi.l.liu@xxxxxxxxx>
>> Signed-off-by: Wu Hao <hao.wu@xxxxxxxxx>
>> Signed-off-by: Jacob Pan <jacob.jun.pan@xxxxxxxxxxxxxxx>
>> ---
>> Documentation/ioasid.rst | 618 +++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 618 insertions(+)
>> create mode 100644 Documentation/ioasid.rst
>>
>> diff --git a/Documentation/ioasid.rst b/Documentation/ioasid.rst
>
> Thanks for writing this up. Should it go to Documentation/driver-api/, or
> Documentation/driver-api/iommu/? I think this also needs to Cc
> linux-doc@xxxxxxxxxxxxxxx and corbet@xxxxxxx
>
>> new file mode 100644
>> index 000000000000..b6a8cdc885ff
>> --- /dev/null
>> +++ b/Documentation/ioasid.rst
>> @@ -0,0 +1,618 @@
>> +.. ioasid:
>> +
>> +=====================================
>> +IO Address Space ID
>> +=====================================
>> +
>> +IOASID is a generic name for PCIe Process Address ID (PASID) or ARM
>> +SMMU sub-stream ID. An IOASID identifies an address space that DMA
>
> "SubstreamID"
On ARM if we don't use PASIDs we have streamids (SID) which can also
identify address spaces that DMA requests can target. So maybe this
definition is not sufficient.

>
>> +requests can target.
>> +
>> +The primary use cases for IOASID are Shared Virtual Address (SVA) and
>> +IO Virtual Address (IOVA). However, the requirements for IOASID
>
> IOVA alone isn't a use case, maybe "multiple IOVA spaces per device"?
>
>> +management can vary among hardware architectures.
>> +
>> +This document covers the generic features supported by IOASID
>> +APIs. Vendor-specific use cases are also illustrated with Intel's VT-d
>> +based platforms as the first example.
>> +
>> +.. contents:: :local:
>> +
>> +Glossary
>> +========
>> +PASID - Process Address Space ID
>> +
>> +IOASID - IO Address Space ID (generic term for PCIe PASID and
>> +sub-stream ID in SMMU)
>
> "SubstreamID"
>
>> +
>> +SVA/SVM - Shared Virtual Addressing/Memory
>> +
>> +ENQCMD - New Intel X86 ISA for efficient workqueue submission [1]
>
> Maybe drop the "New", to keep the documentation perennial. It might be
> good to add internal links here to the specifications URLs at the bottom.
>
>> +
>> +DSA - Intel Data Streaming Accelerator [2]
>> +
>> +VDCM - Virtual device composition module [3]
>> +
>> +SIOV - Intel Scalable IO Virtualization
>> +
>> +
>> +Key Concepts
>> +============
>> +
>> +IOASID Set
>> +-----------
>> +An IOASID set is a group of IOASIDs allocated from the system-wide
>> +IOASID pool. An IOASID set is created and can be identified by a
>> +token of u64. Refer to IOASID set APIs for more details.
>
> Identified either by an u64 or an mm_struct, right? Maybe just drop the
> second sentence if it's detailed in the IOASID set section below.
>
>> +
>> +IOASID set is particularly useful for guest SVA where each guest could
>> +have its own IOASID set for security and efficiency reasons.
>> +
>> +IOASID Set Private ID (SPID)
>> +----------------------------
>> +SPIDs are introduced as IOASIDs within its set. Each SPID maps to a
>> +system-wide IOASID but the namespace of SPID is within its IOASID
>> +set.
>
> The intro isn't super clear. Perhaps this is simpler:
> "Each IOASID set has a private namespace of SPIDs. An SPID maps to a
> single system-wide IOASID."
or, "within an ioasid set, each ioasid can be associated with an alias
ID, named SPID."
>
>> SPIDs can be used as guest IOASIDs where each guest could do
>> +IOASID allocation from its own pool and map them to host physical
>> +IOASIDs. SPIDs are particularly useful for supporting live migration
>> +where decoupling guest and host physical resources are necessary.
>> +
>> +For example, two VMs can both allocate guest PASID/SPID #101 but map to
>> +different host PASIDs #201 and #202 respectively as shown in the
>> +diagram below.
>> +::
>> +
>> + .------------------. .------------------.
>> + | VM 1 | | VM 2 |
>> + | | | |
>> + |------------------| |------------------|
>> + | GPASID/SPID 101 | | GPASID/SPID 101 |
>> + '------------------' -------------------' Guest
>> + __________|______________________|______________________
>> + | | Host
>> + v v
>> + .------------------. .------------------.
>> + | Host IOASID 201 | | Host IOASID 202 |
>> + '------------------' '------------------'
>> + | IOASID set 1 | | IOASID set 2 |
>> + '------------------' '------------------'
>> +
>> +Guest PASID is treated as IOASID set private ID (SPID) within an
>> +IOASID set, mappings between guest and host IOASIDs are stored in the
>> +set for inquiry.
>> +
>> +IOASID APIs
>> +===========
>> +To get the IOASID APIs, users must #include <linux/ioasid.h>. These APIs
>> +serve the following functionalities:
>> +
>> + - IOASID allocation/Free
>> + - Group management in the form of ioasid_set
>> + - Private data storage and lookup
>> + - Reference counting
>> + - Event notification in case of state change
(a)
>> +
>> +IOASID Set Level APIs
>> +--------------------------
>> +For use cases such as guest SVA it is necessary to manage IOASIDs at
>> +a group level. For example, VMs may allocate multiple IOASIDs for
I would use the introduced ioasid_set terminology instead of "group".
>> +guest process address sharing (vSVA). It is imperative to enforce
>> +VM-IOASID ownership such that malicious guest cannot target DMA
>
> "a malicious guest"
>
>> +traffic outside its own IOASIDs, or free an active IOASID belong to
>
> "that belongs to"
>
>> +another VM.
>> +::
>> +
>> + struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, u32 type)
what is this void *token? also the type may be explained here.
>> +
>> + int ioasid_adjust_set(struct ioasid_set *set, int quota);
>
> These could be named "ioasid_set_alloc" and "ioasid_set_adjust" to be
> consistent with the rest of the API.
>
>> +
>> + void ioasid_set_get(struct ioasid_set *set)
>> +
>> + void ioasid_set_put(struct ioasid_set *set)
>> +
>> + void ioasid_set_get_locked(struct ioasid_set *set)
>> +
>> + void ioasid_set_put_locked(struct ioasid_set *set)
>> +
>> + int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
>
> Might be nicer to keep the same argument names within the API. Here "set"
> rather than "sdata".
>
>> + void (*fn)(ioasid_t id, void *data),
>> + void *data)
>
> (alignment)
>
>> +
>> +
>> +IOASID set concept is introduced to represent such IOASID groups. Each
>
> Or just "IOASID sets represent such IOASID groups", but might be
> redundant.
>
>> +IOASID set is created with a token which can be one of the following
>> +types:
I think this explanation should happen before the above function prototypes
>> +
>> + - IOASID_SET_TYPE_NULL (Arbitrary u64 value)
>> + - IOASID_SET_TYPE_MM (Set token is a mm_struct)
>> +
>> +The explicit MM token type is useful when multiple users of an IOASID
>> +set under the same process need to communicate about their shared IOASIDs.
>> +E.g. An IOASID set created by VFIO for one guest can be associated
>> +with the KVM instance for the same guest since they share a common mm_struct.
>> +
>> +The IOASID set APIs serve the following purposes:
>> +
>> + - Ownership/permission enforcement
>> + - Take collective actions, e.g. free an entire set
>> + - Event notifications within a set
>> + - Look up a set based on token
>> + - Quota enforcement
>
> This paragraph could be earlier in the section

yes this is a kind of repetition of (a), above
>
>> +
>> +Individual IOASID APIs
>> +----------------------
>> +Once an ioasid_set is created, IOASIDs can be allocated from the set.
>> +Within the IOASID set namespace, set private ID (SPID) is supported. In
>> +the VM use case, SPID can be used for storing guest PASID.
>> +
>> +::
>> +
>> + ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
>> + void *private);
>> +
>> + int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
>> +
>> + void ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
>> +
>> + int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
>> +
>> + void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
>> +
>> + void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
>> + bool (*getter)(void *));
>> +
>> + ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid)
>> +
>> + int ioasid_attach_data(struct ioasid_set *set, ioasid_t ioasid,
>> + void *data);
>> + int ioasid_attach_spid(struct ioasid_set *set, ioasid_t ioasid,
>> + ioasid_t ssid);
>
> s/ssid/spid>
>> +
>> +
>> +Notifications
>> +-------------
>> +An IOASID may have multiple users, each user may have hardware context
>> +associated with an IOASID. When the status of an IOASID changes,
>> +e.g. an IOASID is being freed, users need to be notified such that the
>> +associated hardware context can be cleared, flushed, and drained.
>> +
>> +::
>> +
>> + int ioasid_register_notifier(struct ioasid_set *set, struct
>> + notifier_block *nb)
>> +
>> + void ioasid_unregister_notifier(struct ioasid_set *set,
>> + struct notifier_block *nb)
>> +
>> + int ioasid_register_notifier_mm(struct mm_struct *mm, struct
>> + notifier_block *nb)
>> +
>> + void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct
>> + notifier_block *nb)
the mm_struct prototypes may be justified
>> +
>> + int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd,
>> + unsigned int flags)
this one is not obvious either.
>> +
>> +
>> +Events
>> +~~~~~~
>> +Notification events are pertinent to individual IOASIDs, they can be
>> +one of the following:
>> +
>> + - ALLOC
>> + - FREE
>> + - BIND
>> + - UNBIND
>> +
>> +Ordering
>> +~~~~~~~~
>> +Ordering is supported by IOASID notification priorities as the
>> +following (in ascending order):
>> +
>> +::
>> +
>> + enum ioasid_notifier_prios {
>> + IOASID_PRIO_LAST,
>> + IOASID_PRIO_IOMMU,
>> + IOASID_PRIO_DEVICE,
>> + IOASID_PRIO_CPU,
>> + };

Maybe:
when registered, notifiers are assigned a priority that affect the call
order. Notifiers with CPU priority get called before notifiers with
device priority and so on.
>> +
>> +The typical use case is when an IOASID is freed due to an exception, DMA
>> +source should be quiesced before tearing down other hardware contexts
>> +in the system. This will reduce the churn in handling faults. DMA work
>> +submission is performed by the CPU which is granted higher priority than
>> +devices.
>> +
>> +
>> +Scopes
>> +~~~~~~
>> +There are two types of notifiers in IOASID core: system-wide and
>> +ioasid_set-wide.
>> +
>> +System-wide notifier is catering for users that need to handle all
>> +IOASIDs in the system. E.g. The IOMMU driver handles all IOASIDs.
>> +
>> +Per ioasid_set notifier can be used by VM specific components such as
>> +KVM. After all, each KVM instance only cares about IOASIDs within its
>> +own set.
>> +
>> +
>> +Atomicity
>> +~~~~~~~~~
>> +IOASID notifiers are atomic due to spinlocks used inside the IOASID
>> +core. For tasks cannot be completed in the notifier handler, async work
>
> "tasks that cannot be"
>
>> +can be submitted to complete the work later as long as there is no
>> +ordering requirement.
>> +
>> +Reference counting
>> +------------------
>> +IOASID lifecycle management is based on reference counting. Users of
>> +IOASID intend to align lifecycle with the IOASID need to hold
>
> "who intend to"
>
>> +reference of the IOASID. IOASID will not be returned to the pool for
>
> "a reference to the IOASID. The IOASID"
>
>> +allocation until all references are dropped. Calling ioasid_free()
>> +will mark the IOASID as FREE_PENDING if the IOASID has outstanding
>> +reference. ioasid_get() is not allowed once an IOASID is in the
>> +FREE_PENDING state.
>> +
>> +Event notifications are used to inform users of IOASID status change.
>> +IOASID_FREE event prompts users to drop their references after
>> +clearing its context.
>> +
>> +For example, on VT-d platform when an IOASID is freed, teardown
>> +actions are performed on KVM, device driver, and IOMMU driver.
>> +KVM shall register notifier block with::
>> +
>> + static struct notifier_block pasid_nb_kvm = {
>> + .notifier_call = pasid_status_change_kvm,
>> + .priority = IOASID_PRIO_CPU,
>> + };
>> +
>> +VDCM driver shall register notifier block with::
>> +
>> + static struct notifier_block pasid_nb_vdcm = {
>> + .notifier_call = pasid_status_change_vdcm,
>> + .priority = IOASID_PRIO_DEVICE,
>> + };
not sure those code snippets are really useful. Maybe simply say who is
supposed to use each prio.
>> +
>> +In both cases, notifier blocks shall be registered on the IOASID set
>> +such that *only* events from the matching VM is received.
>> +
>> +If KVM attempts to register notifier block before the IOASID set is
>> +created for the MM token, the notifier block will be placed on a
using the MM token
>> +pending list inside IOASID core. Once the token matching IOASID set
>> +is created, IOASID will register the notifier block automatically.
Is this implementation mandated? Can't you enforce the ioasid_set to be
created before the notifier gets registered?
>> +IOASID core does not replay events for the existing IOASIDs in the
>> +set. For IOASID set of MM type, notification blocks can be registered
>> +on empty sets only. This is to avoid lost events.
>> +
>> +IOMMU driver shall register notifier block on global chain::
>> +
>> + static struct notifier_block pasid_nb_vtd = {
>> + .notifier_call = pasid_status_change_vtd,
>> + .priority = IOASID_PRIO_IOMMU,
>> + };
>> +
>> +Custom allocator APIs
>> +---------------------
>> +
>> +::
>> +
>> + int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
>> +
>> + void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
>> +
>> +Allocator Choices
>> +~~~~~~~~~~~~~~~~~
>> +IOASIDs are allocated for both host and guest SVA/IOVA usage. However,
>> +allocators can be different. For example, on VT-d guest PASID
>> +allocation must be performed via a virtual command interface which is
>> +emulated by VMM.
>> +
>> +IOASID core has the notion of "custom allocator" such that guest can
>> +register virtual command allocator that precedes the default one.
>> +
>> +Namespaces
>> +~~~~~~~~~~
>> +IOASIDs are limited system resources that default to 20 bits in
>> +size. Since each device has its own table, theoretically the namespace
>> +can be per device also. However, for security reasons sharing PASID
>> +tables among devices are not good for isolation. Therefore, IOASID
>> +namespace is system-wide.
>
> I don't follow this development. Having per-device PASID table would work
> fine for isolation (assuming no hardware bug necessitating IOMMU groups).
> If I remember correctly IOASID space was chosen to be OS-wide because it
> simplifies the management code (single PASID per task), and it is
> system-wide across VMs only in the case of VT-d scalable mode.
>
>> +
>> +There are also other reasons to have this simpler system-wide
>> +namespace. Take VT-d as an example, VT-d supports shared workqueue
>> +and ENQCMD[1] where one IOASID could be used to submit work on
>
> Maybe use the Sphinx glossary syntax rather than "[1]"
> https://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html#glossary-directive
>
>> +multiple devices that are shared with other VMs. This requires IOASID
>> +to be system-wide. This is also the reason why guests must use an
>> +emulated virtual command interface to allocate IOASID from the host.
>> +
>> +
>> +Life cycle
>> +==========
>> +This section covers IOASID lifecycle management for both bare-metal
>> +and guest usages. In bare-metal SVA, MMU notifier is directly hooked
>> +up with IOMMU driver, therefore the process address space (MM)
>> +lifecycle is aligned with IOASID.
therefore the IOASID lifecyle matches the process address space (MM)
lifecyle?
>> +
>> +However, guest MMU notifier is not available to host IOMMU driver,
the guest MMU notifier
>> +when guest MM terminates unexpectedly, the events have to go through
the guest MM
>> +VFIO and IOMMU UAPI to reach host IOMMU driver. There are also more
>> +parties involved in guest SVA, e.g. on Intel VT-d platform, IOASIDs
>> +are used by IOMMU driver, KVM, VDCM, and VFIO.
>> +
>> +Native IOASID Life Cycle (VT-d Example)
>> +---------------------------------------
>> +
>> +The normal flow of native SVA code with Intel Data Streaming
>> +Accelerator(DSA) [2] as example:
>> +
>> +1. Host user opens accelerator FD, e.g. DSA driver, or uacce;
>> +2. DSA driver allocate WQ, do sva_bind_device();
>> +3. IOMMU driver calls ioasid_alloc(), then bind PASID with device,
>> + mmu_notifier_get()
>> +4. DMA starts by DSA driver userspace
>> +5. DSA userspace close FD
>> +6. DSA/uacce kernel driver handles FD.close()
>> +7. DSA driver stops DMA
>> +8. DSA driver calls sva_unbind_device();
>> +9. IOMMU driver does unbind, clears PASID context in IOMMU, flush
>> + TLBs. mmu_notifier_put() called.
>> +10. mmu_notifier.release() called, IOMMU SVA code calls ioasid_free()*
>> +11. The IOASID is returned to the pool, reclaimed.
>> +
>> +::
>> +
>
> Use a footnote? https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#footnotes
>
>> + * With ENQCMD, PASID used on VT-d is not released in mmu_notifier() but
>> + mmdrop(). mmdrop comes after FD close. Should not matter.
>
> "comes after FD close, which doesn't make a difference?"
> The following might not be necessary since early process termination is
> described later.
>
>> + If the user process dies unexpectedly, Step #10 may come before
>> + Step #5, in between, all DMA faults discarded. PRQ responded with
>
> PRQ hasn't been defined in this document.
>
>> + code INVALID REQUEST.
>> +
>> +During the normal teardown, the following three steps would happen in
>> +order:
can't this be illustrated in the above 1-11 sequence, just adding NORMAL
TEARDONW before #7?
>> +
>> +1. Device driver stops DMA request
>> +2. IOMMU driver unbinds PASID and mm, flush all TLBs, drain in-flight
>> + requests.
>> +3. IOASID freed
>> +
Then you can just focus on abnormal termination
>> +Exception happens when process terminates *before* device driver stops
>> +DMA and call IOMMU driver to unbind. The flow of process exists are as
Can't this be explained with something simpler looking at the steps 1-11?
>
> "exits"
>
>> +follows:
>> +
>> +::
>> +
>> + do_exit() {
>> + exit_mm() {
>> + mm_put();
>> + exit_mmap() {
>> + intel_invalidate_range() //mmu notifier
>> + tlb_finish_mmu()
>> + mmu_notifier_release(mm) {
>> + intel_iommu_release() {
>> + [2] intel_iommu_teardown_pasid();
>
> Parentheses might be better than square brackets for step numbers
>
>> + intel_iommu_flush_tlbs();
>> + }
>> + // tlb_invalidate_range cb removed
>> + }
>> + unmap_vmas();
>> + free_pgtables(); // IOMMU cannot walk PGT after this
>> + };
>> + }
>> + exit_files(tsk) {
>> + close_files() {
>> + dsa_close();
>> + [1] dsa_stop_dma();
>> + intel_svm_unbind_pasid(); //nothing to do
>> + }
>> + }
>> + }
>> +
>> + mmdrop() /* some random time later, lazy mm user */ {
>> + mm_free_pgd();
>> + destroy_context(mm); {
>> + [3] ioasid_free();
>> + }
>> + }
>> +
>> +As shown in the list above, step #2 could happen before
>> +#1. Unrecoverable(UR) faults could happen between #2 and #1.
>> +
>> +Also notice that TLB invalidation occurs at mmu_notifier
>> +invalidate_range callback as well as the release callback. The reason
>> +is that release callback will delete IOMMU driver from the notifier
>> +chain which may skip invalidate_range() calls during the exit path.
>> +
>> +To avoid unnecessary reporting of UR fault, IOMMU driver shall disable
UR?
>> +fault reporting after free and before unbind.
>> +
>> +Guest IOASID Life Cycle (VT-d Example)
>> +--------------------------------------
>> +Guest IOASID life cycle starts with guest driver open(), this could be
>> +uacce or individual accelerator driver such as DSA. At FD open,
>> +sva_bind_device() is called which triggers a series of actions.
>> +
>> +The example below is an illustration of *normal* operations that
>> +involves *all* the SW components in VT-d. The flow can be simpler if
>> +no ENQCMD is supported.
>> +
>> +::
>> +
>> + VFIO IOMMU KVM VDCM IOASID Ref
>> + ..................................................................
>> + 1 ioasid_register_notifier/_mm()
>> + 2 ioasid_alloc() 1
>> + 3 bind_gpasid()
>> + 4 iommu_bind()->ioasid_get() 2
>> + 5 ioasid_notify(BIND)
>> + 6 -> ioasid_get() 3
>> + 7 -> vmcs_update_atomic()
>> + 8 mdev_write(gpasid)
>> + 9 hpasid=
>> + 10 find_by_spid(gpasid) 4
>> + 11 vdev_write(hpasid)
>> + 12 -------- GUEST STARTS DMA --------------------------
>> + 13 -------- GUEST STOPS DMA --------------------------
>> + 14 mdev_clear(gpasid)
>> + 15 vdev_clear(hpasid)
>> + 16 ioasid_put() 3
>> + 17 unbind_gpasid()
>> + 18 iommu_ubind()
>> + 19 ioasid_notify(UNBIND)
>> + 20 -> vmcs_update_atomic()
>> + 21 -> ioasid_put() 2
>> + 22 ioasid_free() 1
>> + 23 ioasid_put() 0
>> + 24 Reclaimed
>> + -------------- New Life Cycle Begin ----------------------------
>> + 1 ioasid_alloc() -> 1
>> +
>> + Note: IOASID Notification Events: FREE, BIND, UNBIND
>> +
>> +Exception cases arise when a guest crashes or a malicious guest
>> +attempts to cause disruption on the host system. The fault handling
>> +rules are:
>> +
>> +1. IOASID free must *always* succeed.
>> +2. An inactive period may be required before the freed IOASID is
>> + reclaimed. During this period, consumers of IOASID perform cleanup.
>> +3. Malfunction is limited to the guest owned resources for all
>> + programming errors.
>> +
>> +The primary source of exception is when the following are out of
>> +order:
>> +
>> +1. Start/Stop of DMA activity
>> + (Guest device driver, mdev via VFIO)
please explain the meaning of what is inside (): initiator?
>> +2. Setup/Teardown of IOMMU PASID context, IOTLB, DevTLB flushes
>> + (Host IOMMU driver bind/unbind)
>> +3. Setup/Teardown of VMCS PASID translation table entries (KVM) in
>> + case of ENQCMD
>> +4. Programming/Clearing host PASID in VDCM (Host VDCM driver)
>> +5. IOASID alloc/free (Host IOASID)
>> +
>> +VFIO is the *only* user-kernel interface, which is ultimately
>> +responsible for exception handlings.
>
> "handling"
>
>> +
>> +#1 is processed the same way as the assigned device today based on
>> +device file descriptors and events. There is no special handling.
>> +
>> +#3 is based on bind/unbind events emitted by #2.
>> +
>> +#4 is naturally aligned with IOASID life cycle in that an illegal
>> +guest PASID programming would fail in obtaining reference of the
>> +matching host IOASID.
>> +
>> +#5 is similar to #4. The fault will be reported to the user if PASID
>> +used in the ENQCMD is not set up in VMCS PASID translation table.
>> +
>> +Therefore, the remaining out of order problem is between #2 and
>> +#5. I.e. unbind vs. free. More specifically, free before unbind.
>> +
>> +IOASID notifier and refcounting are used to ensure order. Following
>> +a publisher-subscriber pattern where:
with the following actors:
>> +
>> +- Publishers: VFIO & IOMMU
>> +- Subscribers: KVM, VDCM, IOMMU
this may be introduced before.
>> +
>> +IOASID notifier is atomic which requires subscribers to do quick
>> +handling of the event in the atomic context. Workqueue can be used for
>> +any processing that requires thread context.
repetition of what was said before.
IOASID reference must be
>> +acquired before receiving the FREE event. The reference must be
>> +dropped at the end of the processing in order to return the IOASID to
>> +the pool.
>> +
>> +Let's examine the IOASID life cycle again when free happens *before*
>> +unbind. This could be a result of misbehaving guests or crash. Assuming
>> +VFIO cannot enforce unbind->free order. Notice that the setup part up
>> +until step #12 is identical to the normal case, the flow below starts
>> +with step 13.
>> +
>> +::
>> +
>> + VFIO IOMMU KVM VDCM IOASID Ref
>> + ..................................................................
>> + 13 -------- GUEST STARTS DMA --------------------------
>> + 14 -------- *GUEST MISBEHAVES!!!* ----------------
>> + 15 ioasid_free()
>> + 16 ioasid_notify(FREE)
>> + 17 mark_ioasid_inactive[1]
>> + 18 kvm_nb_handler(FREE)
>> + 19 vmcs_update_atomic()
>> + 20 ioasid_put_locked() -> 3
>> + 21 vdcm_nb_handler(FREE)
>> + 22 iomm_nb_handler(FREE)
>> + 23 ioasid_free() returns[2] schedule_work() 2
>> + 24 schedule_work() vdev_clear_wk(hpasid)
>> + 25 teardown_pasid_wk()
>> + 26 ioasid_put() -> 1
>> + 27 ioasid_put() 0
>> + 28 Reclaimed
>> + 29 unbind_gpasid()
>> + 30 iommu_unbind()->ioasid_find() Fails[3]
>> + -------------- New Life Cycle Begin ----------------------------
>> +
>> +Note:
>> +
>> +1. By marking IOASID inactive at step #17, no new references can be
>
> Is "inactive" FREE_PENDING?
>
>> + held. ioasid_get/find() will return -ENOENT;
>> +2. After step #23, all events can go out of order. Shall not affect
>> + the outcome.
>> +3. IOMMU driver fails to find private data for unbinding. If unbind is
>> + called after the same IOASID is allocated for the same guest again,
>> + this is a programming error. The damage is limited to the guest
>> + itself since unbind performs permission checking based on the
>> + IOASID set associated with the guest process.
>> +
>> +KVM PASID Translation Table Updates
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +Per VM PASID translation table is maintained by KVM in order to
>> +support ENQCMD in the guest. The table contains host-guest PASID
>> +translations to be consumed by CPU ucode. The synchronization of the
>> +PASID states depends on VFIO/IOMMU driver, where IOCTL and atomic
>> +notifiers are used. KVM must register IOASID notifier per VM instance
>> +during launch time. The following events are handled:
>> +
>> +1. BIND/UNBIND
>> +2. FREE
>> +
>> +Rules:
>> +
>> +1. Multiple devices can bind with the same PASID, this can be different PCI
>> + devices or mdevs within the same PCI device. However, only the
>> + *first* BIND and *last* UNBIND emit notifications.
>> +2. IOASID code is responsible for ensuring the correctness of H-G
>> + PASID mapping. There is no need for KVM to validate the
>> + notification data.
>> +3. When UNBIND happens *after* FREE, KVM will see error in
>> + ioasid_get() even when the reclaim is not done. IOMMU driver will
>> + also avoid sending UNBIND if the PASID is already FREE.
>> +4. When KVM terminates *before* FREE & UNBIND, references will be
>> + dropped for all host PASIDs.
>> +
>> +VDCM PASID Programming
>> +~~~~~~~~~~~~~~~~~~~~~~
>> +VDCM composes virtual devices and exposes them to the guests. When
>> +the guest allocates a PASID then program it to the virtual device, VDCM
programs as well
>> +intercepts the programming attempt then program the matching host
>
> "programs"
>
> Thanks,
> Jean
>
>> +PASID on to the hardware.
>> +Conversely, when a device is going away, VDCM must be informed such
>> +that PASID context on the hardware can be cleared. There could be
>> +multiple mdevs assigned to different guests in the same VDCM. Since
>> +the PASID table is shared at PCI device level, lazy clearing is not
>> +secure. A malicious guest can attack by using newly freed PASIDs that
>> +are allocated by another guest.
>> +
>> +By holding a reference of the PASID until VDCM cleans up the HW context,
>> +it is guaranteed that PASID life cycles do not cross within the same
>> +device.
>> +
>> +
>> +Reference
>> +====================================================
>> +1. https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
>> +
>> +2. https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
>> +
>> +3. https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
>> --
>> 2.7.4

Thanks

Eric
>>
>