Re: [PATCH V10 11/11] iommu/vt-d: Add custom allocator for IOASID

From: Jacob Pan
Date: Wed Apr 01 2020 - 11:42:13 EST


On Sat, 28 Mar 2020 10:22:41 +0000
"Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:

> > From: Jacob Pan <jacob.jun.pan@xxxxxxxxxxxxxxx>
> > Sent: Saturday, March 21, 2020 7:28 AM
> >
> > When VT-d driver runs in the guest, PASID allocation must be
> > performed via virtual command interface. This patch registers a
> > custom IOASID allocator which takes precedence over the default
> > XArray based allocator. The resulting IOASID allocation will always
> > come from the host. This ensures that PASID namespace is system-
> > wide.
> >
> > Signed-off-by: Lu Baolu <baolu.lu@xxxxxxxxxxxxxxx>
> > Signed-off-by: Liu, Yi L <yi.l.liu@xxxxxxxxx>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@xxxxxxxxxxxxxxx>
> > ---
> > drivers/iommu/intel-iommu.c | 84
> > +++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/intel-iommu.h | 2 ++
> > 2 files changed, 86 insertions(+)
> >
> > diff --git a/drivers/iommu/intel-iommu.c
> > b/drivers/iommu/intel-iommu.c index a76afb0fd51a..c1c0b0fb93c3
> > 100644 --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -1757,6 +1757,9 @@ static void free_dmar_iommu(struct intel_iommu
> > *iommu)
> > if (ecap_prs(iommu->ecap))
> > intel_svm_finish_prq(iommu);
> > }
> > + if (ecap_vcs(iommu->ecap) && vccap_pasid(iommu->vccap))
> > +
> > ioasid_unregister_allocator(&iommu->pasid_allocator); +
> > #endif
> > }
> >
> > @@ -3291,6 +3294,84 @@ static int copy_translation_tables(struct
> > intel_iommu *iommu)
> > return ret;
> > }
> >
> > +#ifdef CONFIG_INTEL_IOMMU_SVM
> > +static ioasid_t intel_ioasid_alloc(ioasid_t min, ioasid_t max,
> > void *data)
>
> the name is too generic... can we add vcmd in the name to clarify
> its purpose, e.g. intel_vcmd_ioasid_alloc?
>
I feel the intel_ prefix is a natural extension of a generic API, we do
that for other IOMMU APIs, right?

> > +{
> > + struct intel_iommu *iommu = data;
> > + ioasid_t ioasid;
> > +
> > + if (!iommu)
> > + return INVALID_IOASID;
> > + /*
> > + * VT-d virtual command interface always uses the full 20
> > bit
> > + * PASID range. Host can partition guest PASID range based
> > on
> > + * policies but it is out of guest's control.
> > + */
> > + if (min < PASID_MIN || max > intel_pasid_max_id)
> > + return INVALID_IOASID;
> > +
> > + if (vcmd_alloc_pasid(iommu, &ioasid))
> > + return INVALID_IOASID;
> > +
> > + return ioasid;
> > +}
> > +
> > +static void intel_ioasid_free(ioasid_t ioasid, void *data)
> > +{
> > + struct intel_iommu *iommu = data;
> > +
> > + if (!iommu)
> > + return;
> > + /*
> > + * Sanity check the ioasid owner is done at upper layer,
> > e.g. VFIO
> > + * We can only free the PASID when all the devices are
> > unbound.
> > + */
> > + if (ioasid_find(NULL, ioasid, NULL)) {
> > + pr_alert("Cannot free active IOASID %d\n", ioasid);
> > + return;
> > + }
>
> However the sanity check is not done in default_free. Is there a
> reason why using vcmd adds such new requirement?
>
Since we don't support nested guest. This vcmd allocator is only used
by the guest IOMMU driver not VFIO. We expect IOMMU driver to have
control of the free()/unbind() ordering.

For default_free, it can come from user space and host VFIO which can
be out of order. But we will solve that issue with the blocking
notifier.

> > + vcmd_free_pasid(iommu, ioasid);
> > +}
> > +
> > +static void register_pasid_allocator(struct intel_iommu *iommu)
> > +{
> > + /*
> > + * If we are running in the host, no need for custom
> > allocator
> > + * in that PASIDs are allocated from the host system-wide.
> > + */
> > + if (!cap_caching_mode(iommu->cap))
> > + return;
>
> is it more accurate to check against vcmd capability?
>
I think this is sufficient. The spec says if vcmd is present, we must
use it but not the other way.

> > +
> > + if (!sm_supported(iommu)) {
> > + pr_warn("VT-d Scalable Mode not enabled, no PASID
> > allocation\n");
> > + return;
> > + }
> > +
> > + /*
> > + * Register a custom PASID allocator if we are running in
> > a guest,
> > + * guest PASID must be obtained via virtual command
> > interface.
> > + * There can be multiple vIOMMUs in each guest but only one
> > allocator
> > + * is active. All vIOMMU allocators will eventually be
> > calling the same
>
> which one? the first or last?
>
All allocators share the same ops, so first=last. IOASID code will
inspect the ops function and see if they are shared with others then
use the same ops.

> > + * host allocator.
> > + */
> > + if (ecap_vcs(iommu->ecap) && vccap_pasid(iommu->vccap)) {
> > + pr_info("Register custom PASID allocator\n");
> > + iommu->pasid_allocator.alloc = intel_ioasid_alloc;
> > + iommu->pasid_allocator.free = intel_ioasid_free;
> > + iommu->pasid_allocator.pdata = (void *)iommu;
> > + if
> > (ioasid_register_allocator(&iommu->pasid_allocator)) {
> > + pr_warn("Custom PASID allocator failed,
> > scalable mode disabled\n");
> > + /*
> > + * Disable scalable mode on this IOMMU if
> > there
> > + * is no custom allocator. Mixing SM
> > capable vIOMMU
> > + * and non-SM vIOMMU are not supported.
> > + */
> > + intel_iommu_sm = 0;
>
> since you register an allocator for every vIOMMU, means previously
> registered allocators should also be unregistered here?
>
True, but it is not necessary for two reasons:
1. This should not happen unless something went seriously wrong.
All vIOMMU shares the same alloc/free function, so they are put under
the same bucket by IOASID. So the case for the first vIOMMU to succeed
then fail in later vIOMMU registration should not happen. Unless kernel
run out of memory etc.

2. Once SM is disabled, there is no user of ioasid allocator.

> > + }
> > + }
> > +}
> > +#endif
> > +
> > static int __init init_dmars(void)
> > {
> > struct dmar_drhd_unit *drhd;
> > @@ -3408,6 +3489,9 @@ static int __init init_dmars(void)
> > */
> > for_each_active_iommu(iommu, drhd) {
> > iommu_flush_write_buffer(iommu);
> > +#ifdef CONFIG_INTEL_IOMMU_SVM
> > + register_pasid_allocator(iommu);
> > +#endif
> > iommu_set_root_entry(iommu);
> > iommu->flush.flush_context(iommu, 0, 0, 0,
> > DMA_CCMD_GLOBAL_INVL);
> > iommu->flush.flush_iotlb(iommu, 0, 0, 0,
> > DMA_TLB_GLOBAL_FLUSH);
> > diff --git a/include/linux/intel-iommu.h
> > b/include/linux/intel-iommu.h index 9cbf5357138b..9c357a325c72
> > 100644 --- a/include/linux/intel-iommu.h
> > +++ b/include/linux/intel-iommu.h
> > @@ -19,6 +19,7 @@
> > #include <linux/iommu.h>
> > #include <linux/io-64-nonatomic-lo-hi.h>
> > #include <linux/dmar.h>
> > +#include <linux/ioasid.h>
> >
> > #include <asm/cacheflush.h>
> > #include <asm/iommu.h>
> > @@ -563,6 +564,7 @@ struct intel_iommu {
> > #ifdef CONFIG_INTEL_IOMMU_SVM
> > struct page_req_dsc *prq;
> > unsigned char prq_name[16]; /* Name for PRQ interrupt */
> > + struct ioasid_allocator_ops pasid_allocator; /* Custom
> > allocator for PASIDs */
> > #endif
> > struct q_inval *qi; /* Queued invalidation
> > info */ u32 *iommu_state; /* Store iommu states between suspend and
> > resume.*/
> > --
> > 2.7.4
>

[Jacob Pan]