Re: [PATCH v2 1/3] iommu/uapi: Define uapi version and capabilities

From: Jacob Pan
Date: Tue Apr 14 2020 - 13:07:20 EST


On Tue, 14 Apr 2020 10:13:58 -0600
Alex Williamson <alex.williamson@xxxxxxxxxx> wrote:

> On Mon, 13 Apr 2020 22:05:15 -0700
> Jacob Pan <jacob.jun.pan@xxxxxxxxxxxxxxx> wrote:
>
> > Hi Alex,
> > Thanks a lot for the feedback, my comments inline.
> >
> > On Mon, 13 Apr 2020 16:21:29 -0600
> > Alex Williamson <alex.williamson@xxxxxxxxxx> wrote:
> >
> > > On Mon, 13 Apr 2020 13:41:57 -0700
> > > Jacob Pan <jacob.jun.pan@xxxxxxxxxxxxxxx> wrote:
> > >
> > > > Hi All,
> > > >
> > > > Just a gentle reminder, any feedback on the options I listed
> > > > below? New ideas will be even better.
> > > >
> > > > Christoph, does the explanation make sense to you? We do have
> > > > the capability/flag based scheme for IOMMU API extension, the
> > > > version is mainly used for size lookup. Compatibility checking
> > > > is another use of the version, it makes checking easy when a
> > > > vIOMMU is launched.
> > > >
> > > > Thanks,
> > > >
> > > > Jacob
> > > >
> > > > On Thu, 2 Apr 2020 11:36:04 -0700
> > > > Jacob Pan <jacob.jun.pan@xxxxxxxxxxxxxxx> wrote:
> > > >
> > > > > On Wed, 1 Apr 2020 05:32:21 +0000
> > > > > "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:
> > > > >
> > > > > > > From: Jacob Pan <jacob.jun.pan@xxxxxxxxxxxxxxx>
> > > > > > > Sent: Tuesday, March 31, 2020 11:55 PM
> > > > > > >
> > > > > > > On Tue, 31 Mar 2020 06:06:38 +0000
> > > > > > > "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:
> > > > > > >
> > > > > > > > > From: Jacob Pan <jacob.jun.pan@xxxxxxxxxxxxxxx>
> > > > > > > > > Sent: Tuesday, March 31, 2020 12:08 AM
> > > > > > > > >
> > > > > > > > > On Mon, 30 Mar 2020 05:40:40 +0000
> > > > > > > > > "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:
> > > > > > > > >
> > > > > > > > > > > From: Jacob Pan <jacob.jun.pan@xxxxxxxxxxxxxxx>
> > > > > > > > > > > Sent: Saturday, March 28, 2020 7:54 AM
> > > > > > > > > > >
> > > > > > > > > > > On Fri, 27 Mar 2020 00:47:02 -0700
> > > > > > > > > > > Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > On Fri, Mar 27, 2020 at 02:49:55AM +0000, Tian,
> > > > > > > > > > > > Kevin wrote:
> > > > > > > > > > > > > If those API calls are inter-dependent for
> > > > > > > > > > > > > composing a feature (e.g. SVA), shouldn't we
> > > > > > > > > > > > > need a way to check them together before
> > > > > > > > > > > > > exposing the feature to the guest, e.g.
> > > > > > > > > > > > > through a iommu_get_uapi_capabilities
> > > > > > > > > > > > > interface?
> > > > > > > > > > > >
> > > > > > > > > > > > Yes, that makes sense. The important bit is to
> > > > > > > > > > > > have a capability flags and not version
> > > > > > > > > > > > numbers.
> > > > > > > > > > >
> > > > > > > > > > > The challenge is that there are two consumers in
> > > > > > > > > > > the kernel for this. 1. VFIO only look for
> > > > > > > > > > > compatibility, and size of each data struct such
> > > > > > > > > > > that it can copy_from_user.
> > > > > > > > > > >
> > > > > > > > > > > 2. IOMMU driver, the "real consumer" of the
> > > > > > > > > > > content.
> > > > > > > > > > >
> > > > > > > > > > > For 2, I agree and we do plan to use the
> > > > > > > > > > > capability flags to check content and maintain
> > > > > > > > > > > backward compatibility etc.
> > > > > > > > > > >
> > > > > > > > > > > For VFIO, it is difficult to do size look up
> > > > > > > > > > > based on capability flags.
> > > > > > > > > >
> > > > > > > > > > Can you elaborate the difficulty in VFIO? if, as
> > > > > > > > > > Christoph Hellwig pointed out, version number is
> > > > > > > > > > already avoided everywhere, it is interesting to
> > > > > > > > > > know whether this work becomes a real exception or
> > > > > > > > > > just requires a different mindset.
> > > > > > > > > From VFIO p.o.v. the IOMMU UAPI data is opaque, it
> > > > > > > > > only needs to do two things:
> > > > > > > > > 1. is the UAPI compatible?
> > > > > > > > > 2. what is the size to copy?
> > > > > > > > >
> > > > > > > > > If you look at the version number, this is really a
> > > > > > > > > "version as size" lookup, as provided by the helper
> > > > > > > > > function in this patch. An example can be the newly
> > > > > > > > > introduced clone3 syscall.
> > > > > > > > > https://lwn.net/Articles/792628/ In clone3, new
> > > > > > > > > version must have new size. The slight difference
> > > > > > > > > here is that, unlike clone3, we have multiple data
> > > > > > > > > structures instead of a single struct clone_args {}.
> > > > > > > > > And each struct has flags to enumerate its contents
> > > > > > > > > besides size.
> > > > > > > >
> > > > > > > > Thanks for providing that link. However clone3 doesn't
> > > > > > > > include a version field to do "version as size" lookup.
> > > > > > > > Instead, as you said, it includes a size parameter which
> > > > > > > > sounds like the option 3 (argsz) listed below.
> > > > > > > >
> > > > > > > Right, there is no version in clone3. size = version. I
> > > > > > > view this as a 1:1 lookup.
> > > > > > >
> > > > > > > > >
> > > > > > > > > Besides breaching data abstraction, if VFIO has to
> > > > > > > > > check IOMMU flags to determine the sizes, it has many
> > > > > > > > > combinations.
> > > > > > > > >
> > > > > > > > > We also separate the responsibilities into two parts
> > > > > > > > > 1. compatibility - version, size by VFIO
> > > > > > > > > 2. sanity check - capability flags - by
> > > > > > > > > IOMMU
> > > > > > > >
> > > > > > > > I feel argsz+flags approach can perfectly meet above
> > > > > > > > requirement. The userspace set the size and flags for
> > > > > > > > whatever capabilities it uses, and VFIO simply copies
> > > > > > > > the parameters by size and pass to IOMMU for further
> > > > > > > > sanity check. Of course the assumption is that we do
> > > > > > > > provide an interface for userspace to enumerate all
> > > > > > > > supported capabilities.
> > > > > > > You cannot trust user for argsz. the size to be copied
> > > > > > > from user must be based on knowledge in kernel. That is
> > > > > > > why we have this version to size lookup.
> > > > > > >
> > > > > > > In VFIO, the size to copy is based on knowledge of each
> > > > > > > VFIO UAPI structures and VFIO flags. But here the flags
> > > > > > > are IOMMU UAPI flags. As you pointed out in another
> > > > > > > thread, VFIO is one user.
> > > > > >
> > > > > > If that is the case, can we let VFIO only copy its own UAPI
> > > > > > fields while simply passing the user pointer of IOMMU UAPI
> > > > > > structure to IOMMU driver for further size check and copy?
> > > > > > Otherwise we are entering a dead end that VFIO doesn't want
> > > > > > to parse a structure which is not defined by him while using
> > > > > > version to represent the black box size is considered as a
> > > > > > discarded scheme and doesn't scale well...
> > > > > I think this could be an other viable option. Let me try to
> > > > > summarize since this has been a long discussion since the
> > > > > original version.
> > > > >
> > > > > Problem statements:
> > > > > 1. When launching vIOMMU in the guest, how can we ensure the
> > > > > host has compatible support upfront? as compared to fail
> > > > > later.
> > >
> > > This sounds like a feature/extension interface, both KVM and vfio
> > > have them to allow userspace to check support of specific
> > > features.
> > Yes, the specific features are the APIs:
> > - bind guest PASID
> > - cache invalidation
> > - page requests & response
> >
> > > > > 2. As UAPI data gets extended (both in size and flags), how
> > > > > can we know the size to copy
> > >
> > > For vfio we of course use the argsz/flags trick where the user
> > > tells us how big the buffer is and flags in the header tell us
> > > what fields beyond the base specification are enabled. This can
> > > get tricky to extend and there can be confusion whether a flag
> > > indicates the presence of a field or the validity of a field.
> > >
> > > We also have interfaces where the ioctl is a header plus a data
> > > blob where flags tell us what the data is. These can serve
> > > double duty as a extension check too as we've done for
> > > VFIO_DEVICE_FEATURE. This doesn't really support extension of a
> > > defined feature though, rather we'd be more likely to create a
> > > set of flags that indicate the data object is feature-v2 and
> > > redefine the structure, or of course we revisit the entire
> > > featuring question within the structure of that data blob.
> > >
> > > We also implement capability chains, though they're more meant for
> > > passing data to the user, where the user provides a buffer and we
> > > link capabilities together within that buffer for the user to
> > > walk. We've defined a mechanism through -ENOSPC and argsz to
> > > tell the user how large a buffer is necessary. I dare mention we
> > > have a version per capability as these are largely modeled after
> > > capability chains in PCI config space. We haven't actually
> > > incremented any versions, but I imagine we'd do so like PCI,
> > > maintaining backwards compatibility and only defining unused bits
> > > and adding fields as the version increases.
> > I guess capability chain is more suitable since the IOCTL uses
> > container FD instead of device FD in VFIO_DEVICE_FEATURE?
> >
> > We can give that a try by treating IOMMU UAPIs as capabilities. That
> > would address problem #1. We really need to check compatibility
> > upfront in that there is no way to fail some of the UAPIs. e.g.
> > unbind guest PASID, cache invalidation.
> >
> > > Is the objection to a global version or to any version fields? I
> > > don't really understand the global version, I'd think a mechanism
> > > to check extensions plus a per structure flags/version would be
> > > preferred. The former should resolve how userspace can test
> > > support for features requiring multiple interfaces.
> > Currently we already have individual version & flags per UAPI data.
> > The reason why I introduced a global/unifier is to simplify the
> > compatibility checking. Global version is optional.
> > With individual version & flags, user may have to keep track of
> > combinations of per structure versions.
>
> A user would only need to meet the minimum uapi matching the feature
> they want to support, right?
>
Yes, then no need for checking all combinations. I guess validation
cases have to cover all legal combinations, but not an issue.

> > > A global version also
> > > implies that we're only ever adding features and never removing.
> > > For example, feature Foo is added in version 4, but it's replaced
> > > by feature Bar in version 5, now userspace can't simply test
> > > version >= 4 must include feature Foo.
> > >
> > Yes, this is why I was hoping to stick with the rule: open for
> > extension, closed to modification.
> > It also makes the code backward compatible easy since there old code
> > would have no change when adding new features.
>
> But we need a way to deprecate things. The version interface is not
> good for that.
>
Agreed, flags are good for deprecation.

> > > It seems to me that version and flags can also be complimentary,
> > > for example a field might be defined by a version but a flag could
> > > indicate if it's implemented. With only the flag, we'd infer the
> > > field from the flag, with only the version we'd need to assume the
> > > field is always implemented. So I have a hard time making a
> > > blanket statement that all versions fields should be avoided.
> > >
> > > > > 3. Maintain backward compatibility while allowing extensions?
> > > > >
> > > > > I think we all agreed that using flags (capability or types)
> > > > > is the way to address #3. As Christoph pointed out, version
> > > > > number should not be used for this purpose.
> > > > >
> > > > > So for problem 1 & 2, we have the following options:
> > > > > 1. Have a version-size mapping as proposed in this set. VFIO
> > > > > copies from user the correct size based on version-type
> > > > > lookup. Processing of the data is based on flags in IOMMU
> > > > > driver.
> > > > >
> > > > > 2. VFIO copy its own minsz then pass the user pointer to IOMMU
> > > > > driver for further copy_from_user based on flags. (by Kevin)
> > > > >
> > > > > 3. Adopt VFIO argsz scheme, caller fills in argsz for the
> > > > > offset the variable size union. VFIO do not check argsz in
> > > > > that it requires IOMMU specific knowledge. IOMMU driver Use
> > > > > flags to handle the variable size.(by Alex). I think this
> > > > > what we have in Yi's VFIO & QEMU patch. argsz filled by QEMU
> > > > > includes bind_data.
> > > > >
> > > > > 4. Do not use a unified version, have a fixed size of all UAPI
> > > > > structures, padding in struct and union. (Wasteful, not
> > > > > preferred per V1 discussion)
> > > > >
> > > > > For both 2 & 3, a unified version is not used, each API
> > > > > treated separately. vIOMMU will be launched w/o assurance of
> > > > > compatibility of all APIs. Fault handling may be more complex
> > > > > in normal operations.
> > > > >
> > > > > Appreciate everyone's input. Joerg and Alex, could you help to
> > > > > make a decision here?
> > >
> > > As above, I think using a global API version number to imply
> > > support for a feature is doomed to fail, we should instead expose
> > > an interface to check for specific features.
> > I agree. I feel we can use the capability chain at container level
> > as you mentioned, right?
>
> We could either do a capability chain on the IOMMU_GET_INFO ioctl or
> we have the CHECK_EXTENSION ioctl. The latter is quite a bit easier
> to implement, but limited in what it can expose and the semantics
> around checking extensions before vs after an IOMMU model is set can
> be confusing. I didn't have a strong preference between them when
> suggesting Kirti use one of these to indicate dirty page tracking, but
> it would be nice if both features used the same mechanism
>
Sounds good. Just to clarify, similar to Yi's patch that uses
CHECK_EXTENSION
https://lkml.org/lkml/2020/3/22/109

But instead of checking for the global UAPI version, we check for each
feature. i.e.
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index b959d0a37c4f..639e3c970b31 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -46,6 +46,9 @@
* code will taint the host kernel and should be used with extrem
*/
#define VFIO_NOIOMMU_IOMMU 8
+#define VFIO_IOMMU_BIND_GPASID 9
+#define VFIO_IOMMU_INVALIDATE 10
+#define VFIO_IOMMU_PAGE_REQUEST 11

Applications such as QEMU would include uapi/iommu.h, which has the
IOMMU UAPI extensions defined.
e.g.
#define IOMMU_GPASID_BIND_VERSION_1

When VM starts, it will check all uAPI extensions returned by the
kernel against what is in its header, e.g.

extension = ioctl(s->container, VFIO_CHECK_EXTENSION,
VFIO_IOMMU_BIND_GPASID);
if (extension >= IOMMU_GPASID_BIND_VERSION_1) {
/* Kernel has newer extensions, we are good */
} else {
stop_viommu();
}

> > > In any of the proposed solutions, the
> > > IOMMU driver is ultimately responsible for validating the user
> > > data, so do we want vfio performing the copy_from_user() to an
> > > object that could later be assumed to be sanitized, or should
> > > vfio just pass a user pointer to make it obvious that the
> > > consumer is responsible for all the user protections? Seems like
> > > the latter.
> > I like the latter as well.
> >
> > > That still really
> > > doesn't address what's in that user data blob yet, but the vfio
> > > interface could be:
> > >
> > > struct {
> > > __u32 argsz;
> > > __u32 flags;
> > > __u8 data[];
> > > }
> > >
> > > Where flags might be partitioned like we do for DEVICE_FEATURE to
> > > indicate the format of data and what vfio should do with it, and
> > > data might simply be defined as a (__u64 __user *).
> > >
> > So, __user * will be passed to IOMMU driver if VFIO checks minsz
> > include flags and they are valid.
> > IOMMU driver can copy the rest based on the mandatory version/minsz
> > and flags in the IOMMU uAPI structs.
> > Does it sound right? This is really choice #2.
>
> Sounds like each IOMMU UAPI struct just needs to have an embedded size
> and flags field, but yes.
>
Yes, an argsz field can be added to each UAPI. There are already flags
or the equivalent. IOMMU driver can process the __user * based on the
argsz, flags, check argsz against offsetofend(iommu_uapi_struct,
last_element), etc.;

> > > This user pointer would then likely be an IOMMU UAPI struct, so
> > > I've only just gotten back the the IOMMU UAPI question at hand,
> > > but I don't really see the disadvantage to including both version
> > > and flags fields per structure. Perhaps this is choice 1. above,
> > > but with a version at a per structure level indicating the
> > > backwards compatible size and layout of the structure and flags
> > > being used to indicate support for optional features within those
> > > fields.
> >
> > Per structure version & flags is what we have in the mainline. It
> > applies to both choice 1 & 2. The global/unified version is just a
> > re-interpretation of the versions such that we have a monolithic
> > incrementing version. Again, global version is optional.
> >
> > > Is a version field still taboo
> > > for such a use case? Thanks,
> >
> > I will leave that to Christoph :)
>
> Looks like if we drop the global version in favor of an
> extension/feature check and embed size and flags fields in each IOMMU
> UAPI struct, Christoph will be satisfied. Thanks,
>
Agreed, we will work towards it. Thanks a lot!

> Alex
>