Re: [PATCH v2 1/9] docs: Document IO Address Space ID (IOASID) APIs

From: Auger Eric
Date: Mon Sep 07 2020 - 04:04:28 EST


Hi Jacob,

On 9/1/20 6:56 PM, Jacob Pan wrote:
> Hi Eric,
>
> On Thu, 27 Aug 2020 18:21:07 +0200
> Auger Eric <eric.auger@xxxxxxxxxx> wrote:
>
>> Hi Jacob,
>> On 8/24/20 12:32 PM, Jean-Philippe Brucker wrote:
>>> On Fri, Aug 21, 2020 at 09:35:10PM -0700, Jacob Pan wrote:
>>>> IOASID is used to identify address spaces that can be targeted by
>>>> device DMA. It is a system-wide resource that is essential to its
>>>> many users. This document is an attempt to help developers from
>>>> all vendors navigate the APIs. At this time, ARM SMMU and Intel’s
>>>> Scalable IO Virtualization (SIOV) enabled platforms are the
>>>> primary users of IOASID. Examples of how SIOV components interact
>>>> with IOASID APIs are provided in that many APIs are driven by the
>>>> requirements from SIOV.
>>>>
>>>> Signed-off-by: Liu Yi L <yi.l.liu@xxxxxxxxx>
>>>> Signed-off-by: Wu Hao <hao.wu@xxxxxxxxx>
>>>> Signed-off-by: Jacob Pan <jacob.jun.pan@xxxxxxxxxxxxxxx>
>>>> ---
>>>> Documentation/ioasid.rst | 618
>>>> +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed,
>>>> 618 insertions(+) create mode 100644 Documentation/ioasid.rst
>>>>
>>>> diff --git a/Documentation/ioasid.rst b/Documentation/ioasid.rst
>>>
>>> Thanks for writing this up. Should it go to
>>> Documentation/driver-api/, or Documentation/driver-api/iommu/? I
>>> think this also needs to Cc linux-doc@xxxxxxxxxxxxxxx and
>>> corbet@xxxxxxx
>>>> new file mode 100644
>>>> index 000000000000..b6a8cdc885ff
>>>> --- /dev/null
>>>> +++ b/Documentation/ioasid.rst
>>>> @@ -0,0 +1,618 @@
>>>> +.. ioasid:
>>>> +
>>>> +=====================================
>>>> +IO Address Space ID
>>>> +=====================================
>>>> +
>>>> +IOASID is a generic name for PCIe Process Address ID (PASID) or
>>>> ARM +SMMU sub-stream ID. An IOASID identifies an address space
>>>> that DMA
>>>
>>> "SubstreamID"
>> On ARM if we don't use PASIDs we have streamids (SID) which can also
>> identify address spaces that DMA requests can target. So maybe this
>> definition is not sufficient.
>>
> According to SMMU spec, the SubstreamID is equivalent to PASID. My
> understanding is that SID is equivalent to PCI requester ID that
> identifies stage 2. Do you plan to use IOASID for stage 2?
No. So actually if PASID is not used we still have a default single
IOASID matching the single context. So that may be fine as a definition.
> IOASID is mostly for SVA and DMA request w/ PASID.
>
>>>
>>>> +requests can target.
>>>> +
>>>> +The primary use cases for IOASID are Shared Virtual Address (SVA)
>>>> and +IO Virtual Address (IOVA). However, the requirements for
>>>> IOASID
>>>
>>> IOVA alone isn't a use case, maybe "multiple IOVA spaces per
>>> device"?
>>>> +management can vary among hardware architectures.
>>>> +
>>>> +This document covers the generic features supported by IOASID
>>>> +APIs. Vendor-specific use cases are also illustrated with Intel's
>>>> VT-d +based platforms as the first example.
>>>> +
>>>> +.. contents:: :local:
>>>> +
>>>> +Glossary
>>>> +========
>>>> +PASID - Process Address Space ID
>>>> +
>>>> +IOASID - IO Address Space ID (generic term for PCIe PASID and
>>>> +sub-stream ID in SMMU)
>>>
>>> "SubstreamID"
>>>
>>>> +
>>>> +SVA/SVM - Shared Virtual Addressing/Memory
>>>> +
>>>> +ENQCMD - New Intel X86 ISA for efficient workqueue submission
>>>> [1]
>>>
>>> Maybe drop the "New", to keep the documentation perennial. It might
>>> be good to add internal links here to the specifications URLs at
>>> the bottom.
>>>> +
>>>> +DSA - Intel Data Streaming Accelerator [2]
>>>> +
>>>> +VDCM - Virtual device composition module [3]
>>>> +
>>>> +SIOV - Intel Scalable IO Virtualization
>>>> +
>>>> +
>>>> +Key Concepts
>>>> +============
>>>> +
>>>> +IOASID Set
>>>> +-----------
>>>> +An IOASID set is a group of IOASIDs allocated from the system-wide
>>>> +IOASID pool. An IOASID set is created and can be identified by a
>>>> +token of u64. Refer to IOASID set APIs for more details.
>>>
>>> Identified either by an u64 or an mm_struct, right? Maybe just
>>> drop the second sentence if it's detailed in the IOASID set section
>>> below.
>>>> +
>>>> +IOASID set is particularly useful for guest SVA where each guest
>>>> could +have its own IOASID set for security and efficiency reasons.
>>>> +
>>>> +IOASID Set Private ID (SPID)
>>>> +----------------------------
>>>> +SPIDs are introduced as IOASIDs within its set. Each SPID maps to
>>>> a +system-wide IOASID but the namespace of SPID is within its
>>>> IOASID +set.
>>>
>>> The intro isn't super clear. Perhaps this is simpler:
>>> "Each IOASID set has a private namespace of SPIDs. An SPID maps to a
>>> single system-wide IOASID."
>> or, "within an ioasid set, each ioasid can be associated with an alias
>> ID, named SPID."
> I don't have strong opinion, I feel it is good to explain the
> relationship between SPID and IOASID in both directions, how about add?
> " Conversely, each IOASID is associated with an alias ID, named SPID."
yep. I amy suggest: each IOASID may be associated with an alias ID,
local to the IOASID set, named SPID.
>
>>>
>>>> SPIDs can be used as guest IOASIDs where each guest could do
>>>> +IOASID allocation from its own pool and map them to host physical
>>>> +IOASIDs. SPIDs are particularly useful for supporting live
>>>> migration +where decoupling guest and host physical resources are
>>>> necessary. +
>>>> +For example, two VMs can both allocate guest PASID/SPID #101 but
>>>> map to +different host PASIDs #201 and #202 respectively as shown
>>>> in the +diagram below.
>>>> +::
>>>> +
>>>> + .------------------. .------------------.
>>>> + | VM 1 | | VM 2 |
>>>> + | | | |
>>>> + |------------------| |------------------|
>>>> + | GPASID/SPID 101 | | GPASID/SPID 101 |
>>>> + '------------------' -------------------' Guest
>>>> + __________|______________________|______________________
>>>> + | | Host
>>>> + v v
>>>> + .------------------. .------------------.
>>>> + | Host IOASID 201 | | Host IOASID 202 |
>>>> + '------------------' '------------------'
>>>> + | IOASID set 1 | | IOASID set 2 |
>>>> + '------------------' '------------------'
>>>> +
>>>> +Guest PASID is treated as IOASID set private ID (SPID) within an
>>>> +IOASID set, mappings between guest and host IOASIDs are stored in
>>>> the +set for inquiry.
>>>> +
>>>> +IOASID APIs
>>>> +===========
>>>> +To get the IOASID APIs, users must #include <linux/ioasid.h>.
>>>> These APIs +serve the following functionalities:
>>>> +
>>>> + - IOASID allocation/Free
>>>> + - Group management in the form of ioasid_set
>>>> + - Private data storage and lookup
>>>> + - Reference counting
>>>> + - Event notification in case of state change
>> (a)
> got it
>
>>>> +
>>>> +IOASID Set Level APIs
>>>> +--------------------------
>>>> +For use cases such as guest SVA it is necessary to manage IOASIDs
>>>> at +a group level. For example, VMs may allocate multiple IOASIDs
>>>> for
>> I would use the introduced ioasid_set terminology instead of "group".
> Right, we already introduced it.
>
>>>> +guest process address sharing (vSVA). It is imperative to enforce
>>>> +VM-IOASID ownership such that malicious guest cannot target DMA
>>>
>>> "a malicious guest"
>>>
>>>> +traffic outside its own IOASIDs, or free an active IOASID belong
>>>> to
>>>
>>> "that belongs to"
>>>
>>>> +another VM.
>>>> +::
>>>> +
>>>> + struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota,
>>>> u32 type)
>> what is this void *token? also the type may be explained here.
> token is explained in the text following API list. I can move it up.
>
>>>> +
>>>> + int ioasid_adjust_set(struct ioasid_set *set, int quota);
>>>
>>> These could be named "ioasid_set_alloc" and "ioasid_set_adjust" to
>>> be consistent with the rest of the API.
>>>
>>>> +
>>>> + void ioasid_set_get(struct ioasid_set *set)
>>>> +
>>>> + void ioasid_set_put(struct ioasid_set *set)
>>>> +
>>>> + void ioasid_set_get_locked(struct ioasid_set *set)
>>>> +
>>>> + void ioasid_set_put_locked(struct ioasid_set *set)
>>>> +
>>>> + int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
>>>
>>> Might be nicer to keep the same argument names within the API. Here
>>> "set" rather than "sdata".
>>>
>>>> + void (*fn)(ioasid_t id, void
>>>> *data),
>>>> + void *data)
>>>
>>> (alignment)
>>>
>>>> +
>>>> +
>>>> +IOASID set concept is introduced to represent such IOASID groups.
>>>> Each
>>>
>>> Or just "IOASID sets represent such IOASID groups", but might be
>>> redundant.
>>>
>>>> +IOASID set is created with a token which can be one of the
>>>> following +types:
>> I think this explanation should happen before the above function
>> prototypes
> ditto.
>
>>>> +
>>>> + - IOASID_SET_TYPE_NULL (Arbitrary u64 value)
>>>> + - IOASID_SET_TYPE_MM (Set token is a mm_struct)
>>>> +
>>>> +The explicit MM token type is useful when multiple users of an
>>>> IOASID +set under the same process need to communicate about their
>>>> shared IOASIDs. +E.g. An IOASID set created by VFIO for one guest
>>>> can be associated +with the KVM instance for the same guest since
>>>> they share a common mm_struct. +
>>>> +The IOASID set APIs serve the following purposes:
>>>> +
>>>> + - Ownership/permission enforcement
>>>> + - Take collective actions, e.g. free an entire set
>>>> + - Event notifications within a set
>>>> + - Look up a set based on token
>>>> + - Quota enforcement
>>>
>>> This paragraph could be earlier in the section
>>
>> yes this is a kind of repetition of (a), above
> I meant to highlight on what the APIs do such that readers don't
> need to read the code instead.
>
>>>
>>>> +
>>>> +Individual IOASID APIs
>>>> +----------------------
>>>> +Once an ioasid_set is created, IOASIDs can be allocated from the
>>>> set. +Within the IOASID set namespace, set private ID (SPID) is
>>>> supported. In +the VM use case, SPID can be used for storing guest
>>>> PASID. +
>>>> +::
>>>> +
>>>> + ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
>>>> ioasid_t max,
>>>> + void *private);
>>>> +
>>>> + int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
>>>> +
>>>> + void ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
>>>> +
>>>> + int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
>>>> +
>>>> + void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
>>>> +
>>>> + void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
>>>> + bool (*getter)(void *));
>>>> +
>>>> + ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t
>>>> spid) +
>>>> + int ioasid_attach_data(struct ioasid_set *set, ioasid_t ioasid,
>>>> + void *data);
>>>> + int ioasid_attach_spid(struct ioasid_set *set, ioasid_t ioasid,
>>>> + ioasid_t ssid);
>>>
>>> s/ssid/spid>
> got it
>
>>>> +
>>>> +
>>>> +Notifications
>>>> +-------------
>>>> +An IOASID may have multiple users, each user may have hardware
>>>> context +associated with an IOASID. When the status of an IOASID
>>>> changes, +e.g. an IOASID is being freed, users need to be notified
>>>> such that the +associated hardware context can be cleared,
>>>> flushed, and drained. +
>>>> +::
>>>> +
>>>> + int ioasid_register_notifier(struct ioasid_set *set, struct
>>>> + notifier_block *nb)
>>>> +
>>>> + void ioasid_unregister_notifier(struct ioasid_set *set,
>>>> + struct notifier_block *nb)
>>>> +
>>>> + int ioasid_register_notifier_mm(struct mm_struct *mm, struct
>>>> + notifier_block *nb)
>>>> +
>>>> + void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct
>>>> + notifier_block *nb)
>> the mm_struct prototypes may be justified
> This is the mm type token, i.e.
> - IOASID_SET_TYPE_MM (Set token is a mm_struct)
> I am not sure if it is better to keep the explanation in code or in
> this document, certainly don't want to duplicate.
OK. Maybe add a text explaining why it makes sense to register a
notifier at mm_struct granularity.
>
>>>> +
>>>> + int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd,
>>>> + unsigned int flags)
>> this one is not obvious either.
> Here I just wanted to list the API functions, perhaps readers can check
> out the code comments?
OK never mind. The exercise is difficult anyway.
>
>>>> +
>>>> +
>>>> +Events
>>>> +~~~~~~
>>>> +Notification events are pertinent to individual IOASIDs, they can
>>>> be +one of the following:
>>>> +
>>>> + - ALLOC
>>>> + - FREE
>>>> + - BIND
>>>> + - UNBIND
>>>> +
>>>> +Ordering
>>>> +~~~~~~~~
>>>> +Ordering is supported by IOASID notification priorities as the
>>>> +following (in ascending order):
>>>> +
>>>> +::
>>>> +
>>>> + enum ioasid_notifier_prios {
>>>> + IOASID_PRIO_LAST,
>>>> + IOASID_PRIO_IOMMU,
>>>> + IOASID_PRIO_DEVICE,
>>>> + IOASID_PRIO_CPU,
>>>> + };
>>
>> Maybe:
>> when registered, notifiers are assigned a priority that affect the
>> call order. Notifiers with CPU priority get called before notifiers
>> with device priority and so on.
> Sounds good.
>
>>>> +
>>>> +The typical use case is when an IOASID is freed due to an
>>>> exception, DMA +source should be quiesced before tearing down
>>>> other hardware contexts +in the system. This will reduce the churn
>>>> in handling faults. DMA work +submission is performed by the CPU
>>>> which is granted higher priority than +devices.
>>>> +
>>>> +
>>>> +Scopes
>>>> +~~~~~~
>>>> +There are two types of notifiers in IOASID core: system-wide and
>>>> +ioasid_set-wide.
>>>> +
>>>> +System-wide notifier is catering for users that need to handle all
>>>> +IOASIDs in the system. E.g. The IOMMU driver handles all IOASIDs.
>>>> +
>>>> +Per ioasid_set notifier can be used by VM specific components
>>>> such as +KVM. After all, each KVM instance only cares about
>>>> IOASIDs within its +own set.
>>>> +
>>>> +
>>>> +Atomicity
>>>> +~~~~~~~~~
>>>> +IOASID notifiers are atomic due to spinlocks used inside the
>>>> IOASID +core. For tasks cannot be completed in the notifier
>>>> handler, async work
>>>
>>> "tasks that cannot be"
>>>
>>>> +can be submitted to complete the work later as long as there is no
>>>> +ordering requirement.
>>>> +
>>>> +Reference counting
>>>> +------------------
>>>> +IOASID lifecycle management is based on reference counting. Users
>>>> of +IOASID intend to align lifecycle with the IOASID need to hold
>>>
>>> "who intend to"
>>>
>>>> +reference of the IOASID. IOASID will not be returned to the pool
>>>> for
>>>
>>> "a reference to the IOASID. The IOASID"
>>>
>>>> +allocation until all references are dropped. Calling ioasid_free()
>>>> +will mark the IOASID as FREE_PENDING if the IOASID has outstanding
>>>> +reference. ioasid_get() is not allowed once an IOASID is in the
>>>> +FREE_PENDING state.
>>>> +
>>>> +Event notifications are used to inform users of IOASID status
>>>> change. +IOASID_FREE event prompts users to drop their references
>>>> after +clearing its context.
>>>> +
>>>> +For example, on VT-d platform when an IOASID is freed, teardown
>>>> +actions are performed on KVM, device driver, and IOMMU driver.
>>>> +KVM shall register notifier block with::
>>>> +
>>>> + static struct notifier_block pasid_nb_kvm = {
>>>> + .notifier_call = pasid_status_change_kvm,
>>>> + .priority = IOASID_PRIO_CPU,
>>>> + };
>>>> +
>>>> +VDCM driver shall register notifier block with::
>>>> +
>>>> + static struct notifier_block pasid_nb_vdcm = {
>>>> + .notifier_call = pasid_status_change_vdcm,
>>>> + .priority = IOASID_PRIO_DEVICE,
>>>> + };
>> not sure those code snippets are really useful. Maybe simply say who
>> is supposed to use each prio.
> Agreed, not all the bits in the snippets are explained. I will explain
> KVM and VDCM need to use priority to ensure call order.
>
>>>> +
>>>> +In both cases, notifier blocks shall be registered on the IOASID
>>>> set +such that *only* events from the matching VM is received.
>>>> +
>>>> +If KVM attempts to register notifier block before the IOASID set
>>>> is +created for the MM token, the notifier block will be placed on
>>>> a
>> using the MM token
> sounds good
>
>>>> +pending list inside IOASID core. Once the token matching IOASID
>>>> set +is created, IOASID will register the notifier block
>>>> automatically.
>> Is this implementation mandated? Can't you enforce the ioasid_set to
>> be created before the notifier gets registered?
>>>> +IOASID core does not replay events for the existing IOASIDs in the
>>>> +set. For IOASID set of MM type, notification blocks can be
>>>> registered +on empty sets only. This is to avoid lost events.
>>>> +
>>>> +IOMMU driver shall register notifier block on global chain::
>>>> +
>>>> + static struct notifier_block pasid_nb_vtd = {
>>>> + .notifier_call = pasid_status_change_vtd,
>>>> + .priority = IOASID_PRIO_IOMMU,
>>>> + };
>>>> +
>>>> +Custom allocator APIs
>>>> +---------------------
>>>> +
>>>> +::
>>>> +
>>>> + int ioasid_register_allocator(struct ioasid_allocator_ops
>>>> *allocator); +
>>>> + void ioasid_unregister_allocator(struct ioasid_allocator_ops
>>>> *allocator); +
>>>> +Allocator Choices
>>>> +~~~~~~~~~~~~~~~~~
>>>> +IOASIDs are allocated for both host and guest SVA/IOVA usage.
>>>> However, +allocators can be different. For example, on VT-d guest
>>>> PASID +allocation must be performed via a virtual command
>>>> interface which is +emulated by VMM.
>>>> +
>>>> +IOASID core has the notion of "custom allocator" such that guest
>>>> can +register virtual command allocator that precedes the default
>>>> one. +
>>>> +Namespaces
>>>> +~~~~~~~~~~
>>>> +IOASIDs are limited system resources that default to 20 bits in
>>>> +size. Since each device has its own table, theoretically the
>>>> namespace +can be per device also. However, for security reasons
>>>> sharing PASID +tables among devices are not good for isolation.
>>>> Therefore, IOASID +namespace is system-wide.
>>>
>>> I don't follow this development. Having per-device PASID table
>>> would work fine for isolation (assuming no hardware bug
>>> necessitating IOMMU groups). If I remember correctly IOASID space
>>> was chosen to be OS-wide because it simplifies the management code
>>> (single PASID per task), and it is system-wide across VMs only in
>>> the case of VT-d scalable mode.
>>>> +
>>>> +There are also other reasons to have this simpler system-wide
>>>> +namespace. Take VT-d as an example, VT-d supports shared workqueue
>>>> +and ENQCMD[1] where one IOASID could be used to submit work on
>>>
>>> Maybe use the Sphinx glossary syntax rather than "[1]"
>>> https://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html#glossary-directive
>>>
>>>> +multiple devices that are shared with other VMs. This requires
>>>> IOASID +to be system-wide. This is also the reason why guests must
>>>> use an +emulated virtual command interface to allocate IOASID from
>>>> the host. +
>>>> +
>>>> +Life cycle
>>>> +==========
>>>> +This section covers IOASID lifecycle management for both
>>>> bare-metal +and guest usages. In bare-metal SVA, MMU notifier is
>>>> directly hooked +up with IOMMU driver, therefore the process
>>>> address space (MM) +lifecycle is aligned with IOASID.
>> therefore the IOASID lifecyle matches the process address space (MM)
>> lifecyle?
> Sounds good.
>
>>>> +
>>>> +However, guest MMU notifier is not available to host IOMMU
>>>> driver,
>> the guest MMU notifier
>>>> +when guest MM terminates unexpectedly, the events have to go
>>>> through
>> the guest MM
>>>> +VFIO and IOMMU UAPI to reach host IOMMU driver. There are also
>>>> more +parties involved in guest SVA, e.g. on Intel VT-d platform,
>>>> IOASIDs +are used by IOMMU driver, KVM, VDCM, and VFIO.
>>>> +
>>>> +Native IOASID Life Cycle (VT-d Example)
>>>> +---------------------------------------
>>>> +
>>>> +The normal flow of native SVA code with Intel Data Streaming
>>>> +Accelerator(DSA) [2] as example:
>>>> +
>>>> +1. Host user opens accelerator FD, e.g. DSA driver, or uacce;
>>>> +2. DSA driver allocate WQ, do sva_bind_device();
>>>> +3. IOMMU driver calls ioasid_alloc(), then bind PASID with device,
>>>> + mmu_notifier_get()
>>>> +4. DMA starts by DSA driver userspace
>>>> +5. DSA userspace close FD
>>>> +6. DSA/uacce kernel driver handles FD.close()
>>>> +7. DSA driver stops DMA
>>>> +8. DSA driver calls sva_unbind_device();
>>>> +9. IOMMU driver does unbind, clears PASID context in IOMMU, flush
>>>> + TLBs. mmu_notifier_put() called.
>>>> +10. mmu_notifier.release() called, IOMMU SVA code calls
>>>> ioasid_free()* +11. The IOASID is returned to the pool, reclaimed.
>>>> +
>>>> +::
>>>> +
>>>
>>> Use a footnote?
>>> https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#footnotes
>>>> + * With ENQCMD, PASID used on VT-d is not released in
>>>> mmu_notifier() but
>>>> + mmdrop(). mmdrop comes after FD close. Should not matter.
>>>
>>> "comes after FD close, which doesn't make a difference?"
>>> The following might not be necessary since early process
>>> termination is described later.
>>>
>>>> + If the user process dies unexpectedly, Step #10 may come
>>>> before
>>>> + Step #5, in between, all DMA faults discarded. PRQ responded
>>>> with
>>>
>>> PRQ hasn't been defined in this document.
>>>
>>>> + code INVALID REQUEST.
>>>> +
>>>> +During the normal teardown, the following three steps would
>>>> happen in +order:
>> can't this be illustrated in the above 1-11 sequence, just adding
>> NORMAL TEARDONW before #7?
>>>> +
>>>> +1. Device driver stops DMA request
>>>> +2. IOMMU driver unbinds PASID and mm, flush all TLBs, drain
>>>> in-flight
>>>> + requests.
>>>> +3. IOASID freed
>>>> +
>> Then you can just focus on abnormal termination
> Yes, will refer to the steps starting #7. These can be removed.
>
>>>> +Exception happens when process terminates *before* device driver
>>>> stops +DMA and call IOMMU driver to unbind. The flow of process
>>>> exists are as
>> Can't this be explained with something simpler looking at the steps
>> 1-11?
> It meant to be educational given this level of details. Simpler
> steps are labeled with (1) (2) (3). Perhaps these labels didn't stand
> out right? I will use the steps in the 1-11 sequence.
>
>>>
>>> "exits"
>>>
>>>> +follows:
>>>> +
>>>> +::
>>>> +
>>>> + do_exit() {
>>>> + exit_mm() {
>>>> + mm_put();
>>>> + exit_mmap() {
>>>> + intel_invalidate_range() //mmu notifier
>>>> + tlb_finish_mmu()
>>>> + mmu_notifier_release(mm) {
>>>> + intel_iommu_release() {
>>>> + [2]
>>>> intel_iommu_teardown_pasid();
>>>
>>> Parentheses might be better than square brackets for step numbers
>>>
>>>> + intel_iommu_flush_tlbs();
>>>> + }
>>>> + // tlb_invalidate_range cb removed
>>>> + }
>>>> + unmap_vmas();
>>>> + free_pgtables(); // IOMMU cannot walk PGT
>>>> after this
>>>> + };
>>>> + }
>>>> + exit_files(tsk) {
>>>> + close_files() {
>>>> + dsa_close();
>>>> + [1] dsa_stop_dma();
>>>> + intel_svm_unbind_pasid(); //nothing to do
>>>> + }
>>>> + }
>>>> + }
>>>> +
>>>> + mmdrop() /* some random time later, lazy mm user */ {
>>>> + mm_free_pgd();
>>>> + destroy_context(mm); {
>>>> + [3] ioasid_free();
>>>> + }
>>>> + }
>>>> +
>>>> +As shown in the list above, step #2 could happen before
>>>> +#1. Unrecoverable(UR) faults could happen between #2 and #1.
>>>> +
>>>> +Also notice that TLB invalidation occurs at mmu_notifier
>>>> +invalidate_range callback as well as the release callback. The
>>>> reason +is that release callback will delete IOMMU driver from the
>>>> notifier +chain which may skip invalidate_range() calls during the
>>>> exit path. +
>>>> +To avoid unnecessary reporting of UR fault, IOMMU driver shall
>>>> disable
>> UR?
> Unrecoverable, mentioned in the previous paragraph.
>
>>>> +fault reporting after free and before unbind.
>>>> +
>>>> +Guest IOASID Life Cycle (VT-d Example)
>>>> +--------------------------------------
>>>> +Guest IOASID life cycle starts with guest driver open(), this
>>>> could be +uacce or individual accelerator driver such as DSA. At
>>>> FD open, +sva_bind_device() is called which triggers a series of
>>>> actions. +
>>>> +The example below is an illustration of *normal* operations that
>>>> +involves *all* the SW components in VT-d. The flow can be simpler
>>>> if +no ENQCMD is supported.
>>>> +
>>>> +::
>>>> +
>>>> + VFIO IOMMU KVM VDCM IOASID
>>>> Ref
>>>> + ..................................................................
>>>> + 1 ioasid_register_notifier/_mm()
>>>> + 2 ioasid_alloc()
>>>> 1
>>>> + 3 bind_gpasid()
>>>> + 4 iommu_bind()->ioasid_get()
>>>> 2
>>>> + 5 ioasid_notify(BIND)
>>>> + 6 -> ioasid_get()
>>>> 3
>>>> + 7 -> vmcs_update_atomic()
>>>> + 8 mdev_write(gpasid)
>>>> + 9 hpasid=
>>>> + 10 find_by_spid(gpasid)
>>>> 4
>>>> + 11 vdev_write(hpasid)
>>>> + 12 -------- GUEST STARTS DMA --------------------------
>>>> + 13 -------- GUEST STOPS DMA --------------------------
>>>> + 14 mdev_clear(gpasid)
>>>> + 15 vdev_clear(hpasid)
>>>> + 16
>>>> ioasid_put() 3
>>>> + 17 unbind_gpasid()
>>>> + 18 iommu_ubind()
>>>> + 19 ioasid_notify(UNBIND)
>>>> + 20 -> vmcs_update_atomic()
>>>> + 21 ->
>>>> ioasid_put() 2
>>>> + 22
>>>> ioasid_free() 1
>>>> + 23
>>>> ioasid_put() 0
>>>> + 24 Reclaimed
>>>> + -------------- New Life Cycle Begin
>>>> ----------------------------
>>>> + 1 ioasid_alloc()
>>>> -> 1 +
>>>> + Note: IOASID Notification Events: FREE, BIND, UNBIND
>>>> +
>>>> +Exception cases arise when a guest crashes or a malicious guest
>>>> +attempts to cause disruption on the host system. The fault
>>>> handling +rules are:
>>>> +
>>>> +1. IOASID free must *always* succeed.
>>>> +2. An inactive period may be required before the freed IOASID is
>>>> + reclaimed. During this period, consumers of IOASID perform
>>>> cleanup. +3. Malfunction is limited to the guest owned resources
>>>> for all
>>>> + programming errors.
>>>> +
>>>> +The primary source of exception is when the following are out of
>>>> +order:
>>>> +
>>>> +1. Start/Stop of DMA activity
>>>> + (Guest device driver, mdev via VFIO)
>> please explain the meaning of what is inside (): initiator?
>>>> +2. Setup/Teardown of IOMMU PASID context, IOTLB, DevTLB flushes
>>>> + (Host IOMMU driver bind/unbind)
>>>> +3. Setup/Teardown of VMCS PASID translation table entries (KVM) in
>>>> + case of ENQCMD
>>>> +4. Programming/Clearing host PASID in VDCM (Host VDCM driver)
>>>> +5. IOASID alloc/free (Host IOASID)
>>>> +
>>>> +VFIO is the *only* user-kernel interface, which is ultimately
>>>> +responsible for exception handlings.
>>>
>>> "handling"
>>>
>>>> +
>>>> +#1 is processed the same way as the assigned device today based on
>>>> +device file descriptors and events. There is no special handling.
>>>> +
>>>> +#3 is based on bind/unbind events emitted by #2.
>>>> +
>>>> +#4 is naturally aligned with IOASID life cycle in that an illegal
>>>> +guest PASID programming would fail in obtaining reference of the
>>>> +matching host IOASID.
>>>> +
>>>> +#5 is similar to #4. The fault will be reported to the user if
>>>> PASID +used in the ENQCMD is not set up in VMCS PASID translation
>>>> table. +
>>>> +Therefore, the remaining out of order problem is between #2 and
>>>> +#5. I.e. unbind vs. free. More specifically, free before unbind.
>>>> +
>>>> +IOASID notifier and refcounting are used to ensure order.
>>>> Following +a publisher-subscriber pattern where:
>> with the following actors:
>>>> +
>>>> +- Publishers: VFIO & IOMMU
>>>> +- Subscribers: KVM, VDCM, IOMMU
>> this may be introduced before.
>>>> +
>>>> +IOASID notifier is atomic which requires subscribers to do quick
>>>> +handling of the event in the atomic context. Workqueue can be
>>>> used for +any processing that requires thread context.
>> repetition of what was said before.
>> IOASID reference must be
> Right, will remove.
>
>>>> +acquired before receiving the FREE event. The reference must be
>>>> +dropped at the end of the processing in order to return the
>>>> IOASID to +the pool.
>>>> +
>>>> +Let's examine the IOASID life cycle again when free happens
>>>> *before* +unbind. This could be a result of misbehaving guests or
>>>> crash. Assuming +VFIO cannot enforce unbind->free order. Notice
>>>> that the setup part up +until step #12 is identical to the normal
>>>> case, the flow below starts +with step 13.
>>>> +
>>>> +::
>>>> +
>>>> + VFIO IOMMU KVM VDCM IOASID
>>>> Ref
>>>> + ..................................................................
>>>> + 13 -------- GUEST STARTS DMA --------------------------
>>>> + 14 -------- *GUEST MISBEHAVES!!!* ----------------
>>>> + 15 ioasid_free()
>>>> + 16
>>>> ioasid_notify(FREE)
>>>> + 17
>>>> mark_ioasid_inactive[1]
>>>> + 18 kvm_nb_handler(FREE)
>>>> + 19 vmcs_update_atomic()
>>>> + 20 ioasid_put_locked() ->
>>>> 3
>>>> + 21 vdcm_nb_handler(FREE)
>>>> + 22 iomm_nb_handler(FREE)
>>>> + 23 ioasid_free() returns[2] schedule_work()
>>>> 2
>>>> + 24 schedule_work() vdev_clear_wk(hpasid)
>>>> + 25 teardown_pasid_wk()
>>>> + 26 ioasid_put() ->
>>>> 1
>>>> + 27 ioasid_put()
>>>> 0
>>>> + 28 Reclaimed
>>>> + 29 unbind_gpasid()
>>>> + 30 iommu_unbind()->ioasid_find() Fails[3]
>>>> + -------------- New Life Cycle Begin
>>>> ---------------------------- +
>>>> +Note:
>>>> +
>>>> +1. By marking IOASID inactive at step #17, no new references can
>>>> be
>>>
>>> Is "inactive" FREE_PENDING?
>>>
>>>> + held. ioasid_get/find() will return -ENOENT;
>>>> +2. After step #23, all events can go out of order. Shall not
>>>> affect
>>>> + the outcome.
>>>> +3. IOMMU driver fails to find private data for unbinding. If
>>>> unbind is
>>>> + called after the same IOASID is allocated for the same guest
>>>> again,
>>>> + this is a programming error. The damage is limited to the guest
>>>> + itself since unbind performs permission checking based on the
>>>> + IOASID set associated with the guest process.
>>>> +
>>>> +KVM PASID Translation Table Updates
>>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>> +Per VM PASID translation table is maintained by KVM in order to
>>>> +support ENQCMD in the guest. The table contains host-guest PASID
>>>> +translations to be consumed by CPU ucode. The synchronization of
>>>> the +PASID states depends on VFIO/IOMMU driver, where IOCTL and
>>>> atomic +notifiers are used. KVM must register IOASID notifier per
>>>> VM instance +during launch time. The following events are handled:
>>>> +
>>>> +1. BIND/UNBIND
>>>> +2. FREE
>>>> +
>>>> +Rules:
>>>> +
>>>> +1. Multiple devices can bind with the same PASID, this can be
>>>> different PCI
>>>> + devices or mdevs within the same PCI device. However, only the
>>>> + *first* BIND and *last* UNBIND emit notifications.
>>>> +2. IOASID code is responsible for ensuring the correctness of H-G
>>>> + PASID mapping. There is no need for KVM to validate the
>>>> + notification data.
>>>> +3. When UNBIND happens *after* FREE, KVM will see error in
>>>> + ioasid_get() even when the reclaim is not done. IOMMU driver
>>>> will
>>>> + also avoid sending UNBIND if the PASID is already FREE.
>>>> +4. When KVM terminates *before* FREE & UNBIND, references will be
>>>> + dropped for all host PASIDs.
>>>> +
>>>> +VDCM PASID Programming
>>>> +~~~~~~~~~~~~~~~~~~~~~~
>>>> +VDCM composes virtual devices and exposes them to the guests. When
>>>> +the guest allocates a PASID then program it to the virtual
>>>> device, VDCM
>> programs as well
>>>> +intercepts the programming attempt then program the matching
>>>> host
>>>
>>> "programs"
>>>
>>> Thanks,
>>> Jean
>>>
>>>> +PASID on to the hardware.
>>>> +Conversely, when a device is going away, VDCM must be informed
>>>> such +that PASID context on the hardware can be cleared. There
>>>> could be +multiple mdevs assigned to different guests in the same
>>>> VDCM. Since +the PASID table is shared at PCI device level, lazy
>>>> clearing is not +secure. A malicious guest can attack by using
>>>> newly freed PASIDs that +are allocated by another guest.
>>>> +
>>>> +By holding a reference of the PASID until VDCM cleans up the HW
>>>> context, +it is guaranteed that PASID life cycles do not cross
>>>> within the same +device.
>>>> +
>>>> +
>>>> +Reference
>>>> +====================================================
>>>> +1.
>>>> https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
>>>> + +2.
>>>> https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
>>>> + +3.
>>>> https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
>>>> -- 2.7.4
>>
>> Thanks
>>
>> Eric
>>>>
>>>
>>
>> _______________________________________________
>> iommu mailing list
>> iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx
>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
> [Jacob Pan]
>
Thanks

Eric