Re: [RFC 14/20] iommu/iommufd: Add iommufd_device_[de]attach_ioasid()

From: Jason Gunthorpe
Date: Tue Sep 21 2021 - 14:02:10 EST


On Sun, Sep 19, 2021 at 02:38:42PM +0800, Liu Yi L wrote:
> An I/O address space takes effect in the iommu only after it's attached
> by a device. This patch provides iommufd_device_[de/at]tach_ioasid()
> helpers for this purpose. One device can be only attached to one ioasid
> at this point, but one ioasid can be attached by multiple devices.
>
> The caller specifies the iommufd_device (returned at binding time) and
> the target ioasid when calling the helper function. Upon request, iommufd
> installs the specified I/O page table to the correct place in the IOMMU,
> according to the routing information (struct device* which represents
> RID) recorded in iommufd_device. Future variants could allow the caller
> to specify additional routing information (e.g. pasid/ssid) when multiple
> I/O address spaces are supported per device.
>
> Open:
> Per Jason's comment in below link, bus-specific wrappers are recommended.
> This RFC implements one wrapper for pci device. But it looks that struct
> pci_device is not used at all since iommufd_ device already carries all
> necessary info. So want to have another discussion on its necessity, e.g.
> whether making more sense to have bus-specific wrappers for binding, while
> leaving a common attaching helper per iommufd_device.
> https://lore.kernel.org/linux-iommu/20210528233649.GB3816344@xxxxxxxxxx/
>
> TODO:
> When multiple devices are attached to a same ioasid, the permitted iova
> ranges and supported pgsize bitmap on this ioasid should be a common
> subset of all attached devices. iommufd needs to track such info per
> ioasid and update it every time when a new device is attached to the
> ioasid. This has not been done in this version yet, due to the temporary
> hack adopted in patch 16-18. The hack reuses vfio type1 driver which
> already includes the necessary logic for iova ranges and pgsize bitmap.
> Once we get a clear direction for those patches, that logic will be moved
> to this patch.
>
> Signed-off-by: Liu Yi L <yi.l.liu@xxxxxxxxx>
> drivers/iommu/iommufd/iommufd.c | 226 ++++++++++++++++++++++++++++++++
> include/linux/iommufd.h | 29 ++++
> 2 files changed, 255 insertions(+)
>
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> index e45d76359e34..25373a0e037a 100644
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -51,6 +51,19 @@ struct iommufd_ioas {
> bool enforce_snoop;
> struct iommufd_ctx *ictx;
> refcount_t refs;
> + struct mutex lock;
> + struct list_head device_list;
> + struct iommu_domain *domain;

This should just be another xarray indexed by the device id

> +/* Caller should hold ioas->lock */
> +static struct ioas_device_info *ioas_find_device(struct iommufd_ioas *ioas,
> + struct iommufd_device *idev)
> +{
> + struct ioas_device_info *ioas_dev;
> +
> + list_for_each_entry(ioas_dev, &ioas->device_list, next) {
> + if (ioas_dev->idev == idev)
> + return ioas_dev;
> + }

Which eliminates this search. xarray with tightly packed indexes is
generally more efficient than linked lists..

> +static int ioas_check_device_compatibility(struct iommufd_ioas *ioas,
> + struct device *dev)
> +{
> + bool snoop = false;
> + u32 addr_width;
> + int ret;
> +
> + /*
> + * currently we only support I/O page table with iommu enforce-snoop
> + * format. Attaching a device which doesn't support this format in its
> + * upstreaming iommu is rejected.
> + */
> + ret = iommu_device_get_info(dev, IOMMU_DEV_INFO_FORCE_SNOOP, &snoop);
> + if (ret || !snoop)
> + return -EINVAL;
> +
> + ret = iommu_device_get_info(dev, IOMMU_DEV_INFO_ADDR_WIDTH, &addr_width);
> + if (ret || addr_width < ioas->addr_width)
> + return -EINVAL;
> +
> + /* TODO: also need to check permitted iova ranges and pgsize bitmap */
> +
> + return 0;
> +}

This seems kind of weird..

I expect the iommufd to hold a SW copy of the IO page table and each
time a new domain is to be created it should push the SW copy into the
domain. If the domain cannot support it then the domain driver should
naturally fail a request.

When the user changes the IO page table the SW copy is updated then
all of the domains are updated too - again if any domain cannot
support the change then it fails and the change is rolled back.

It seems like this is a side effect of roughly hacking in the vfio
code?

> +
> +/**
> + * iommufd_device_attach_ioasid - attach device to an ioasid
> + * @idev: [in] Pointer to struct iommufd_device.
> + * @ioasid: [in] ioasid points to an I/O address space.
> + *
> + * Returns 0 for successful attach, otherwise returns error.
> + *
> + */
> +int iommufd_device_attach_ioasid(struct iommufd_device *idev, int ioasid)

Types for the ioas_id again..

> +{
> + struct iommufd_ioas *ioas;
> + struct ioas_device_info *ioas_dev;
> + struct iommu_domain *domain;
> + int ret;
> +
> + ioas = ioasid_get_ioas(idev->ictx, ioasid);
> + if (!ioas) {
> + pr_err_ratelimited("Trying to attach illegal or unkonwn IOASID %u\n", ioasid);
> + return -EINVAL;

No prints triggered by bad userspace

> + }
> +
> + mutex_lock(&ioas->lock);
> +
> + /* Check for duplicates */
> + if (ioas_find_device(ioas, idev)) {
> + ret = -EINVAL;
> + goto out_unlock;
> + }

just xa_cmpxchg NULL, XA_ZERO_ENTRY

> + /*
> + * Each ioas is backed by an iommu domain, which is allocated
> + * when the ioas is attached for the first time and then shared
> + * by following devices.
> + */
> + if (list_empty(&ioas->device_list)) {

Seems strange, what if the devices are forced to have different
domains? We don't want to model that in the SW layer..

> + /* Install the I/O page table to the iommu for this device */
> + ret = iommu_attach_device(domain, idev->dev);
> + if (ret)
> + goto out_domain;

This is where things start to get confusing when you talk about PASID
as the above call needs to be some PASID centric API.

> @@ -27,6 +28,16 @@ struct iommufd_device *
> iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie);
> void iommufd_unbind_device(struct iommufd_device *idev);
>
> +int iommufd_device_attach_ioasid(struct iommufd_device *idev, int ioasid);
> +void iommufd_device_detach_ioasid(struct iommufd_device *idev, int ioasid);
> +
> +static inline int
> +__pci_iommufd_device_attach_ioasid(struct pci_dev *pdev,
> + struct iommufd_device *idev, int ioasid)
> +{
> + return iommufd_device_attach_ioasid(idev, ioasid);
> +}

If think sis taking in the iommfd_device then there isn't a logical
place to signal the PCIness

But, I think the API should at least have a kdoc that this is
capturing the entire device and specify that for PCI this means all
TLPs with the RID.

Jason