RE: [PATCH v3 1/5] docs: IOMMU user API
From: Tian, Kevin
Date: Mon Jun 29 2020 - 22:52:50 EST
> From: Jacob Pan
> Sent: Tuesday, June 30, 2020 7:05 AM
>
> On Fri, 26 Jun 2020 16:19:23 -0600
> Alex Williamson <alex.williamson@xxxxxxxxxx> wrote:
>
> > On Tue, 23 Jun 2020 10:03:53 -0700
> > Jacob Pan <jacob.jun.pan@xxxxxxxxxxxxxxx> wrote:
> >
> > > IOMMU UAPI is newly introduced to support communications between
> > > guest virtual IOMMU and host IOMMU. There has been lots of
> > > discussions on how it should work with VFIO UAPI and userspace in
> > > general.
> > >
> > > This document is indended to clarify the UAPI design and usage. The
> > > mechenics of how future extensions should be achieved are also
> > > covered in this documentation.
> > >
> > > Signed-off-by: Liu Yi L <yi.l.liu@xxxxxxxxx>
> > > Signed-off-by: Jacob Pan <jacob.jun.pan@xxxxxxxxxxxxxxx>
> > > ---
> > > Documentation/userspace-api/iommu.rst | 244
> > > ++++++++++++++++++++++++++++++++++ 1 file changed, 244 insertions(+)
> > > create mode 100644 Documentation/userspace-api/iommu.rst
> > >
> > > diff --git a/Documentation/userspace-api/iommu.rst
> > > b/Documentation/userspace-api/iommu.rst new file mode 100644
> > > index 000000000000..f9e4ed90a413
> > > --- /dev/null
> > > +++ b/Documentation/userspace-api/iommu.rst
> > > @@ -0,0 +1,244 @@
> > > +.. SPDX-License-Identifier: GPL-2.0
> > > +.. iommu:
> > > +
> > > +=====================================
> > > +IOMMU Userspace API
> > > +=====================================
> > > +
> > > +IOMMU UAPI is used for virtualization cases where communications
> > > are +needed between physical and virtual IOMMU drivers. For native
> > > +usage, IOMMU is a system device which does not need to communicate
> > > +with user space directly.
> > > +
> > > +The primary use cases are guest Shared Virtual Address (SVA) and
> > > +guest IO virtual address (IOVA), wherein a virtual IOMMU (vIOMMU)
> > > is +required to communicate with the physical IOMMU in the host.
> > > +
> > > +.. contents:: :local:
> > > +
> > > +Functionalities
> > > +===============
> > > +Communications of user and kernel involve both directions. The
> > > +supported user-kernel APIs are as follows:
> > > +
> > > +1. Alloc/Free PASID
> > > +2. Bind/unbind guest PASID (e.g. Intel VT-d)
> > > +3. Bind/unbind guest PASID table (e.g. ARM sMMU)
> > > +4. Invalidate IOMMU caches
> > > +5. Service page requests
> > > +
> > > +Requirements
> > > +============
> > > +The IOMMU UAPIs are generic and extensible to meet the following
> > > +requirements:
> > > +
> > > +1. Emulated and para-virtualised vIOMMUs
> > > +2. Multiple vendors (Intel VT-d, ARM sMMU, etc.)
> > > +3. Extensions to the UAPI shall not break existing user space
> > > +
> > > +Interfaces
> > > +==========
> > > +Although the data structures defined in IOMMU UAPI are
> > > self-contained, +there is no user API functions introduced.
> > > Instead, IOMMU UAPI is +designed to work with existing user driver
> > > frameworks such as VFIO. +
> > > +Extension Rules & Precautions
> > > +-----------------------------
> > > +When IOMMU UAPI gets extended, the data structures can *only* be
> > > +modified in two ways:
> > > +
> > > +1. Adding new fields by re-purposing the padding[] field. No size
> > > change. +2. Adding new union members at the end. May increase in
> > > size. +
> > > +No new fields can be added *after* the variable sized union in
> > > that it +will break backward compatibility when offset moves. In
> > > both cases, a +new flag must be accompanied with a new field such
> > > that the IOMMU +driver can process the data based on the new flag.
> > > Version field is +only reserved for the unlikely event of UAPI
> > > upgrade at its entirety. +
> > > +It's *always* the caller's responsibility to indicate the size of
> > > the +structure passed by setting argsz appropriately.
> > > +Though at the same time, argsz is user provided data which is not
> > > +trusted. The argsz field allows the user to indicate how much data
> > > +they're providing, it's still the kernel's responsibility to
> > > validate +whether it's correct and sufficient for the requested
> > > operation. +
> > > +Compatibility Checking
> > > +----------------------
> > > +When IOMMU UAPI extension results in size increase, user such as
> > > VFIO +has to handle the following cases:
> > > +
> > > +1. User and kernel has exact size match
> > > +2. An older user with older kernel header (smaller UAPI size)
> > > running on a
> > > + newer kernel (larger UAPI size)
> > > +3. A newer user with newer kernel header (larger UAPI size) running
> > > + on an older kernel.
> > > +4. A malicious/misbehaving user pass illegal/invalid size but
> > > within
> > > + range. The data may contain garbage.
> >
> > What exactly does vfio need to do to handle these?
> >
> VFIO does nothing other than returning the status from IOMMU driver.
> Based on the return status, users such as QEMU can cause fault
> conditions within the vIOMMU.
But from above description, "user such as VFIO has to handle the
following cases"...
Thanks
Kevin
>
> > > +
> > > +Feature Checking
> > > +----------------
> > > +While launching a guest with vIOMMU, it is important to ensure
> > > that host +can support the UAPI data structures to be used for
> > > vIOMMU-pIOMMU +communications. Without upfront compatibility
> > > checking, future faults +are difficult to report even in normal
> > > conditions. For example, TLB +invalidations should always succeed.
> > > There is no architectural way to +report back to the vIOMMU if the
> > > UAPI data is incompatible. If that +happens, in order to protect
> > > IOMMU iosolation guarantee, we have to +resort to not giving
> > > completion status in vIOMMU. This may result in +VM hang.
> > > +
> > > +For this reason the following IOMMU UAPIs cannot fail:
> > > +
> > > +1. Free PASID
> > > +2. Unbind guest PASID
> > > +3. Unbind guest PASID table (SMMU)
> > > +4. Cache invalidate
> > > +
> > > +User applications such as QEMU is expected to import kernel UAPI
> > > +headers. Backward compatibility is supported per feature flags.
> > > +For example, an older QEMU (with older kernel header) can run on
> > > newer +kernel. Newer QEMU (with new kernel header) may refuse to
> > > initialize +on an older kernel if new feature flags are not
> > > supported by older +kernel. Simply recompile existing code with
> > > newer kernel header should +not be an issue in that only existing
> > > flags are used. +
> > > +IOMMU vendor driver should report the below features to IOMMU UAPI
> > > +consumers (e.g. via VFIO).
> > > +
> > > +1. IOMMU_NESTING_FEAT_SYSWIDE_PASID
> > > +2. IOMMU_NESTING_FEAT_BIND_PGTBL
> > > +3. IOMMU_NESTING_FEAT_BIND_PASID_TABLE
> > > +4. IOMMU_NESTING_FEAT_CACHE_INVLD
> > > +5. IOMMU_NESTING_FEAT_PAGE_REQUEST
> > > +
> > > +Take VFIO as example, upon request from VFIO user space (e.g.
> > > QEMU), +VFIO kernel code shall query IOMMU vendor driver for the
> > > support of +the above features. Query result can then be reported
> > > back to the +user-space caller. Details can be found in
> > > +Documentation/driver-api/vfio.rst.
> > > +
> > > +
> > > +Data Passing Example with VFIO
> > > +------------------------------
> > > +As the ubiquitous userspace driver framework, VFIO is already IOMMU
> > > +aware and share many key concepts such as device model, group, and
> > > +protection domain. Other user driver frameworks can also be
> > > extended +to support IOMMU UAPI but it is outside the scope of this
> > > document. +
> > > +In this tight-knit VFIO-IOMMU interface, the ultimate consumer of
> > > the +IOMMU UAPI data is the host IOMMU driver. VFIO facilitates
> > > user-kernel +transport, capability checking, security, and life
> > > cycle management of +process address space ID (PASID).
> > > +
> > > +Unlike normal user data passed via VFIO UAPI IOTCL, IOMMU driver
> > > is the +ultimate consumer of its UAPI data. At VFIO layer, the
> > > IOMMU UAPI data +is wrapped in a VFIO UAPI data. It follows the
> > > +pattern below::
> > > +
> > > + struct {
> > > + __u32 argsz;
> > > + __u32 flags;
> > > + __u8 data[];
> > > + };
> > > +
> > > +Here data[] contains the IOMMU UAPI data structures. VFIO has the
> > > +freedom to bundle the data as well as parse data size based on its
> > > own flags. +
> > > +In order to determine the size and feature set of the user data,
> > > argsz +and flags are also embedded in the IOMMU UAPI data
> > > structures. +A "__u32 argsz" field is *always* at the beginning of
> > > each structure. +
> > > +For example:
> > > +::
> > > +
> > > + struct iommu_cache_invalidate_info {
> > > + __u32 argsz;
> > > + #define IOMMU_CACHE_INVALIDATE_INFO_VERSION_1 1
> > > + __u32 version;
> > > + /* IOMMU paging structure cache */
> > > + #define IOMMU_CACHE_INV_TYPE_IOTLB (1 << 0) /*
> > > IOMMU IOTLB */
> > > + #define IOMMU_CACHE_INV_TYPE_DEV_IOTLB (1 << 1) /*
> > > Device IOTLB */
> > > + #define IOMMU_CACHE_INV_TYPE_PASID (1 << 2) /*
> > > PASID cache */
> > > + #define IOMMU_CACHE_INV_TYPE_NR (3)
> > > + __u8 cache;
> > > + __u8 granularity;
> > > + __u8 padding[2];
> > > + union {
> > > + struct iommu_inv_pasid_info pasid_info;
> > > + struct iommu_inv_addr_info addr_info;
> > > + } granu;
> > > + };
> > > +
> > > +VFIO is responsible for checking its own argsz and flags then
> > > invokes +appropriate IOMMU UAPI functions. User pointer is passed
> > > to IOMMU +layer for further processing. The responsibilities are
> > > divided as +follows:
> > > +
> > > +- Generic IOMMU layer checks argsz range and override out-of-range
> > > + value. If the exact argsz is based on generic flags, they are
> > > checked
> > > + here as well.
> > > +
> > > +- Vendor IOMMU driver checks argsz based on vendor flags, UAPI data
> > > + is consumed based on flags
> > > +
> > > +Once again, use guest TLB invalidation as an example, argsz is
> > > based +on generic flags in the invalidation information. IOMMU
> > > generic code +shall process the UAPI data as the following:
> > > +
> > > +::
> > > +
> > > + int iommu_cache_invalidate(struct iommu_domain *domain, struct
> > > device *dev,
> > > + void __user *uinfo)
> > > + {
> > > + /* Current kernel data size is the max to be copied from
> > > user */
> > > + maxsz = sizeof(struct iommu_cache_invalidate_info);
> > > + memset((void *)&inv_info, 0, maxsz);
> > > +
> > > + /*
> > > + * No new spaces can be added before the variable sized
> > > union, the
> > > + * minimum size is the offset to the union.
> > > + */
> > > + minsz = offsetof(struct iommu_cache_invalidate_info,
> > > granu); +
> > > + /* Copy minsz from user to get flags and argsz */
> > > + if (copy_from_user(&inv_info, uinfo, minsz))
> > > + return -EFAULT;
> > > +
> > > + /* Fields before variable size union is mandatory */
> > > + if (inv_info.argsz < minsz)
> > > + return -EINVAL;
> > > + /*
> > > + * User might be using a newer UAPI header which has a
> > > larger data
> > > + * size, we shall support the existing flags within the
> > > current
> > > + * size.
> > > + */
> > > + if (inv_info.argsz > maxsz)
> > > + inv_info.argsz = maxsz;
> > > +
> > > + /* Checking the exact argsz based on generic flags */
> > > + if (inv_info.granularity == IOMMU_INV_GRANU_ADDR &&
> > > + inv_info.argsz != offsetofend(struct
> > > iommu_cache_invalidate_info,
> > > + granu.addr_info))
> >
> > Is it really reasonable to expect the user to specify argsz to the
> > exact union element for the callback? I'd certainly expect users to
> > simply use sizeof(struct iommu_cache_invalidate_info) and it should
> > therefore be sufficient to test >= here rather than jump through hoops
> > with an exact size. We're already changing inv_info.argsz above to
> > fit our known structure, it's inconsistent to then expect it to be
> > some exact value.
> >
> I was thinking argsz doesn't have to be the exact struct size. It should
> be whatever the sufficient & correct size used by the user for a given
> call.
>
> For example, current struct iommu_gpasid_bind_data {} only has VT-d
> data. If it gets extended with SMMU data in the union, VT-d vIOMMU
> emulation should only fill the union size of vt-d.
>
> > > + return -EINVAL;
> > > +
> > > + if (inv_info.granularity == IOMMU_INV_GRANU_PASID &&
> > > + inv_info.argsz != offsetofend(struct
> > > iommu_cache_invalidate_info,
> > > + granu.pasid_info))
> > > + return -EINVAL;
> > > +
> > > + /* Copy the remaining user data _after_ minsz */
> > > + if (copy_from_user((void *)&inv_info + minsz, uinfo +
> > > minsz,
> > > + inv_info.argsz - minsz))
> > > + return -EFAULT;
> > > +
> > > + return domain->ops->cache_invalidate(domain, dev,
> > > &inv_info);
> > > + }
> > > + Add a wrapper
> > > + __iommu_unbind_( kernel data, same user data, kernel copy)
> > > +
> This should be removed. Sorry about the confusion. The patch does not
> have two data pointers, just separate APIs for kernel and user.
>
> > > +Notice that in this example, since union size is determined by
> > > generic +flags, all checking to argsz is validated in the generic
> > > IOMMU layer, +vendor driver does not need to check argsz. However,
> > > if union size is +based on vendor data, such as
> > > iommu_sva_bind_gpasid(), it will be +vendor driver's responsibility
> > > to validate the exact argsz.
> >
> > struct iommu_cache_invalidate_info is a good example because it
> > explicitly states a table of type vs granularity validity. When the
> > cache_invalidate() callback is used by an internal user we can
> > consider it a bug in the caller if its usage falls outside of these
> > prescribed valid combinations, ie. iommu_ops callbacks may assume a
> > trusted caller that isn't trying to exploit any loophole.
> Separate APIs are proposed in the patchset to address UAPIs
> with both kernel and user callers. Sorry about the last line in the
> example above. Currently, only unbind_gpasid() and page_response() have
> both kernel and userspace callers. e.g.
>
> /* userspace caller */
> int iommu_sva_unbind_gpasid(struct iommu_domain *domain, struct
> device *dev,
> void __user *udata)
>
> /* in-kernel caller */
> int __iommu_sva_unbind_gpasid(struct iommu_domain *domain, struct
> device *dev,
> struct iommu_gpasid_bind_data *data)
>
> We don;t expect in-kernel caller for cache invalidate in that it is
> implied in unmap, unbind operations.
>
> > But here
> > we've done nothing more than validated the supplied size to pass it
> > through to a non-user hardened callback. We didn't check the
> > version,
> Yes, I should move up the version check from vendor driver.
>
> > we didn't check that any of the undefined bits in cache or
> > granularity or padding were set, we don't know what flags might be
> > set in the union elements.
> You are right, we should sanitize reserved bits.
>
> > For example, if a user is able to set a
> > flag that gets ignored now, that means we can never use that flag
> > without potentially breaking that user in the future.
> Good point, all reserved/unused bits should be tested.
>
> > If a user can
> > pass in version 3141592654 now, then we can never use version for
> > validation. I see that intel_iommu_sva_invalidate() does test the
> > version, but has no obvious other hardening. I'm afraid we're being
> > far to lax about accepting a data structure provided by a user, we
> > should not assume good faith. Thanks,
> >
> Agreed. will add checks in the IOMMU generic layer for reserved
> bits.
> For VT-d vendor driver, we do check all bits in cache types, i.e. in
> intel/iommu.c
> for_each_set_bit(cache_type,
> (unsigned long *)&inv_info->cache,
> IOMMU_CACHE_INV_TYPE_NR) {
>
>
> one other hardening is to check vendor argsz. This is in the
> bind_gpasid call.
>
> if (data->argsz != offsetofend(struct iommu_gpasid_bind_data,
> vendor.vtd))
> return -EINVAL;
>
>
>
> > Alex
> >
>
> [Jacob Pan]
>
>
>
> _______________________________________________
> iommu mailing list
> iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx
> https://lists.linuxfoundation.org/mailman/listinfo/iommu