Re: [PATCH v13 15/15] vfio/type1: Return the MSI geometry through VFIO_IOMMU_GET_INFO capability chains

From: Alex Williamson
Date: Fri Oct 07 2016 - 16:38:26 EST


On Fri, 7 Oct 2016 19:10:27 +0200
Auger Eric <eric.auger@xxxxxxxxxx> wrote:

> Hi Alex,
>
> On 06/10/2016 22:42, Alex Williamson wrote:
> > On Thu, 6 Oct 2016 14:20:40 -0600
> > Alex Williamson <alex.williamson@xxxxxxxxxx> wrote:
> >
> >> On Thu, 6 Oct 2016 08:45:31 +0000
> >> Eric Auger <eric.auger@xxxxxxxxxx> wrote:
> >>
> >>> This patch allows the user-space to retrieve the MSI geometry. The
> >>> implementation is based on capability chains, now also added to
> >>> VFIO_IOMMU_GET_INFO.
> >>>
> >>> The returned info comprise:
> >>> - whether the MSI IOVA are constrained to a reserved range (x86 case) and
> >>> in the positive, the start/end of the aperture,
> >>> - or whether the IOVA aperture need to be set by the userspace. In that
> >>> case, the size and alignment of the IOVA window to be provided are
> >>> returned.
> >>>
> >>> In case the userspace must provide the IOVA aperture, we currently report
> >>> a size/alignment based on all the doorbells registered by the host kernel.
> >>> This may exceed the actual needs.
> >>>
> >>> Signed-off-by: Eric Auger <eric.auger@xxxxxxxxxx>
> >>>
> >>> ---
> >>> v11 -> v11:
> >>> - msi_doorbell_pages was renamed msi_doorbell_calc_pages
> >>>
> >>> v9 -> v10:
> >>> - move cap_offset after iova_pgsizes
> >>> - replace __u64 alignment by __u32 order
> >>> - introduce __u32 flags in vfio_iommu_type1_info_cap_msi_geometry and
> >>> fix alignment
> >>> - call msi-doorbell API to compute the size/alignment
> >>>
> >>> v8 -> v9:
> >>> - use iommu_msi_supported flag instead of programmable
> >>> - replace IOMMU_INFO_REQUIRE_MSI_MAP flag by a more sophisticated
> >>> capability chain, reporting the MSI geometry
> >>>
> >>> v7 -> v8:
> >>> - use iommu_domain_msi_geometry
> >>>
> >>> v6 -> v7:
> >>> - remove the computation of the number of IOVA pages to be provisionned.
> >>> This number depends on the domain/group/device topology which can
> >>> dynamically change. Let's rely instead rely on an arbitrary max depending
> >>> on the system
> >>>
> >>> v4 -> v5:
> >>> - move msi_info and ret declaration within the conditional code
> >>>
> >>> v3 -> v4:
> >>> - replace former vfio_domains_require_msi_mapping by
> >>> more complex computation of MSI mapping requirements, especially the
> >>> number of pages to be provided by the user-space.
> >>> - reword patch title
> >>>
> >>> RFC v1 -> v1:
> >>> - derived from
> >>> [RFC PATCH 3/6] vfio: Extend iommu-info to return MSIs automap state
> >>> - renamed allow_msi_reconfig into require_msi_mapping
> >>> - fixed VFIO_IOMMU_GET_INFO
> >>> ---
> >>> drivers/vfio/vfio_iommu_type1.c | 78 ++++++++++++++++++++++++++++++++++++++++-
> >>> include/uapi/linux/vfio.h | 32 ++++++++++++++++-
> >>> 2 files changed, 108 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >>> index dc3ee5d..ce5e7eb 100644
> >>> --- a/drivers/vfio/vfio_iommu_type1.c
> >>> +++ b/drivers/vfio/vfio_iommu_type1.c
> >>> @@ -38,6 +38,8 @@
> >>> #include <linux/workqueue.h>
> >>> #include <linux/dma-iommu.h>
> >>> #include <linux/msi-doorbell.h>
> >>> +#include <linux/irqdomain.h>
> >>> +#include <linux/msi.h>
> >>>
> >>> #define DRIVER_VERSION "0.2"
> >>> #define DRIVER_AUTHOR "Alex Williamson <alex.williamson@xxxxxxxxxx>"
> >>> @@ -1101,6 +1103,55 @@ static int vfio_domains_have_iommu_cache(struct vfio_iommu *iommu)
> >>> return ret;
> >>> }
> >>>
> >>> +static int compute_msi_geometry_caps(struct vfio_iommu *iommu,
> >>> + struct vfio_info_cap *caps)
> >>> +{
> >>> + struct vfio_iommu_type1_info_cap_msi_geometry *vfio_msi_geometry;
> >>> + unsigned long order = __ffs(vfio_pgsize_bitmap(iommu));
> >>> + struct iommu_domain_msi_geometry msi_geometry;
> >>> + struct vfio_info_cap_header *header;
> >>> + struct vfio_domain *d;
> >>> + bool reserved;
> >>> + size_t size;
> >>> +
> >>> + mutex_lock(&iommu->lock);
> >>> + /* All domains have same require_msi_map property, pick first */
> >>> + d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
> >>> + iommu_domain_get_attr(d->domain, DOMAIN_ATTR_MSI_GEOMETRY,
> >>> + &msi_geometry);
> >>> + reserved = !msi_geometry.iommu_msi_supported;
> >>> +
> >>> + mutex_unlock(&iommu->lock);
> >>> +
> >>> + size = sizeof(*vfio_msi_geometry);
> >>> + header = vfio_info_cap_add(caps, size,
> >>> + VFIO_IOMMU_TYPE1_INFO_CAP_MSI_GEOMETRY, 1);
> >>> +
> >>> + if (IS_ERR(header))
> >>> + return PTR_ERR(header);
> >>> +
> >>> + vfio_msi_geometry = container_of(header,
> >>> + struct vfio_iommu_type1_info_cap_msi_geometry,
> >>> + header);
> >>> +
> >>> + vfio_msi_geometry->flags = reserved;
> >>
> >> Use the bit flag VFIO_IOMMU_MSI_GEOMETRY_RESERVED
> >>
> >>> + if (reserved) {
> >>> + vfio_msi_geometry->aperture_start = msi_geometry.aperture_start;
> >>> + vfio_msi_geometry->aperture_end = msi_geometry.aperture_end;
> >>
> >> But maybe nobody has set these, did you intend to use
> >> iommu_domain_msi_aperture_valid(), which you defined early on but never
> >> used?
> >>
> >>> + return 0;
> >>> + }
> >>> +
> >>> + vfio_msi_geometry->order = order;
> >>
> >> I'm tempted to suggest that a user could do the same math on their own
> >> since we provide the supported bitmap already... could it ever not be
> >> the same?
> >>
> >>> + /*
> >>> + * we compute a system-wide requirement based on all the registered
> >>> + * doorbells
> >>> + */
> >>> + vfio_msi_geometry->size =
> >>> + msi_doorbell_calc_pages(order) * ((uint64_t) 1 << order);
> >>> +
> >>> + return 0;
> >>> +}
> >>> +
> >>> static long vfio_iommu_type1_ioctl(void *iommu_data,
> >>> unsigned int cmd, unsigned long arg)
> >>> {
> >>> @@ -1122,8 +1173,10 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> >>> }
> >>> } else if (cmd == VFIO_IOMMU_GET_INFO) {
> >>> struct vfio_iommu_type1_info info;
> >>> + struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
> >>> + int ret;
> >>>
> >>> - minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
> >>> + minsz = offsetofend(struct vfio_iommu_type1_info, cap_offset);
> >>>
> >>> if (copy_from_user(&info, (void __user *)arg, minsz))
> >>> return -EFAULT;
> >>> @@ -1135,6 +1188,29 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> >>>
> >>> info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
> >>>
> >>> + ret = compute_msi_geometry_caps(iommu, &caps);
> >>> + if (ret)
> >>> + return ret;
> >>> +
> >>> + if (caps.size) {
> >>> + info.flags |= VFIO_IOMMU_INFO_CAPS;
> >>> + if (info.argsz < sizeof(info) + caps.size) {
> >>> + info.argsz = sizeof(info) + caps.size;
> >>> + info.cap_offset = 0;
> >>> + } else {
> >>> + vfio_info_cap_shift(&caps, sizeof(info));
> >>> + if (copy_to_user((void __user *)arg +
> >>> + sizeof(info), caps.buf,
> >>> + caps.size)) {
> >>> + kfree(caps.buf);
> >>> + return -EFAULT;
> >>> + }
> >>> + info.cap_offset = sizeof(info);
> >>> + }
> >>> +
> >>> + kfree(caps.buf);
> >>> + }
> >>> +
> >>> return copy_to_user((void __user *)arg, &info, minsz) ?
> >>> -EFAULT : 0;
> >>>
> >>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >>> index 4a9dbc2..8dae013 100644
> >>> --- a/include/uapi/linux/vfio.h
> >>> +++ b/include/uapi/linux/vfio.h
> >>> @@ -488,7 +488,35 @@ struct vfio_iommu_type1_info {
> >>> __u32 argsz;
> >>> __u32 flags;
> >>> #define VFIO_IOMMU_INFO_PGSIZES (1 << 0) /* supported page sizes info */
> >>> - __u64 iova_pgsizes; /* Bitmap of supported page sizes */
> >>> +#define VFIO_IOMMU_INFO_CAPS (1 << 1) /* Info supports caps */
> >>> + __u64 iova_pgsizes; /* Bitmap of supported page sizes */
> >>> + __u32 __resv;
> >>> + __u32 cap_offset; /* Offset within info struct of first cap */
> >>> +};
> >>
> >> I understand the padding, but not the ordering. Why not end with
> >> padding?
> >>
> >>> +
> >>> +#define VFIO_IOMMU_TYPE1_INFO_CAP_MSI_GEOMETRY 1
> >>> +
> >>> +/*
> >>> + * The MSI geometry capability allows to report the MSI IOVA geometry:
> >>> + * - either the MSI IOVAs are constrained within a reserved IOVA aperture
> >>> + * whose boundaries are given by [@aperture_start, @aperture_end].
> >>> + * this is typically the case on x86 host. The userspace is not allowed
> >>> + * to map userspace memory at IOVAs intersecting this range using
> >>> + * VFIO_IOMMU_MAP_DMA.
> >>> + * - or the MSI IOVAs are not requested to belong to any reserved range;
> >>> + * in that case the userspace must provide an IOVA window characterized by
> >>> + * @size and @alignment using VFIO_IOMMU_MAP_DMA with RESERVED_MSI_IOVA flag.
> >>> + */
> >>> +struct vfio_iommu_type1_info_cap_msi_geometry {
> >>> + struct vfio_info_cap_header header;
> >>> + __u32 flags;
> >>> +#define VFIO_IOMMU_MSI_GEOMETRY_RESERVED (1 << 0) /* reserved geometry */
> >>> + /* not reserved */
> >>> + __u32 order; /* iommu page order used for aperture alignment*/
> >>> + __u64 size; /* IOVA aperture size (bytes) the userspace must provide */
> >>> + /* reserved */
> >>> + __u64 aperture_start;
> >>> + __u64 aperture_end;
> >>
> >> Should these be a union? We never set them both. Should the !reserved
> >> case have a flag as well, so the user can positively identify what's
> >> being provided?
> >
> > Actually, is there really any need to fit both of these within the same
> > structure? Part of the idea of the capability chains is we can create
> > a capability for each new thing we want to describe. So, we could
> > simply define a generic reserved IOVA range capability with a 'start'
> > and 'end' and then another capability to define MSI mapping
> > requirements. Thanks,
> Yes your suggested approach makes sense to me.
>
> One reason why I proceeded that way is we are mixing things at iommu.h
> level too. Personally I would have preferred to separate things:
> 1) add a new IOMMU_CAP_TRANSLATE_MSI capability in iommu_cap
> 2) rename iommu_msi_supported into "programmable" bool: reporting
> whether the aperture is reserved or programmable.
>
> In the early releases I think it was as above but slightly we moved to a
> mixed description.
>
> What do you think?

The API certainly doesn't seem like it has a cohesive feel to me. It's
not entirely clear to me how we know when we need to register a DMA MSI
cookie, or how we know that the MSI doorbell API is actually
initialized and in use by the MSI/IOMMU layer, or exactly what is the
MSI geometry telling me. Perhaps this is why the code doesn't seem to
have a good rejection mechanism for architectures that need it versus
those that don't, it's too hard to tell.

Maybe we can look at what we think the user API should be and work
backwards. For x86 we simply have a reserved range of IOVA. I'm not
entirely sure it adds to the user API to know that it's for MSI, it's
just a range of IOVAs that we cannot allocate for regular DMA. In
fact, we currently lack a mechanism for describing the IOVA space of
the IOMMU at all, so rather than focusing on a mechanism to describe a
hole in the IOVA space, we might simply want to focus on a mechanism to
describe the available IOVA space. Everybody needs that, not just
x86. That sort of sounds like a VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE
that perhaps looks like:

struct vfio_iommu_type1_info_cap_iova_range {
struct vfio_info_cap_header header;
u64 start;
u64 end;
};

Clearly we need to allow multiple of these in the capability chain
since the existing x86 MSI range bisects this address space.

To support this, we basically need the same information from the IOMMU
API. We already have DOMAIN_ATTR_GEOMETRY, which should give us the
base IOVA range, but we don't have anything describing the gaps. We
don't know how many sources of gaps we'll have in the future, but let's
keep it simple and assume we can look for MSI gaps and add other
possible sources of gaps in the future, it's an internal API after all.
So we can use DOMAIN_ATTR_MSI_GEOMETRY to tell us about the (we assume
one) MSI range of reserved IOVA within DOMAIN_ATTR_GEOMETRY. For x86
this is fixed, for SMMU this is a zero range until someone programs it.

Now, what does a user need to know to add a reserved MSI IOVA range?
They need to know a) that it needs to be done, and b) how big to make
it (and maybe alignment requirements). Really all we need to describe
then is b) since b) implies a). So maybe that gives us another
capability chain entry:

struct vfio_iommu_type1_info_cap_msi_resv {
struct vfio_info_cap_header header;
u64 size;
u64 alignment;
};

It doesn't seem like we need to waste a flag bit on
vfio_iommu_type1_info.flags for this since the existence of this
capability would imply that VFIO_IOMMU_MAP_DMA supports an MSI_RESV
flag.

So what do we need from the kernel infrastructure to make that happen?
Well, we need a) and b) above, and again b) can imply a), so if the
IOMMU API provided a DOMAIN_ATTR_MSI_RESV, providing the same
size/alignment, then we're nearly there. Then we just need a way to
set that range, which I'd probably try to plumb through the IOMMU API
rather than pulling in separate doorbell APIs and DMA cookie APIs. If
it's going to pull together all those different things, let's at least
only do that in one place so we can expose a consistent API through the
IOMMU API. Obviously once a range is set, DOMAIN_ATTR_MSI_RESV should
report that range, so if the user were to look at the type1 info
capability chain again, the available IOVA ranges would reflect the now
reserved range.

Maybe that's more than you're asking for, but that's the approach I
would take to solidify the API. Thanks,

Alex