Re: [RFC 01/20] iommu/iommufd: Add /dev/iommu core

From: Jason Gunthorpe
Date: Tue Sep 21 2021 - 11:41:54 EST


On Sun, Sep 19, 2021 at 02:38:29PM +0800, Liu Yi L wrote:
> /dev/iommu aims to provide a unified interface for managing I/O address
> spaces for devices assigned to userspace. This patch adds the initial
> framework to create a /dev/iommu node. Each open of this node returns an
> iommufd. And this fd is the handle for userspace to initiate its I/O
> address space management.
>
> One open:
> - We call this feature as IOMMUFD in Kconfig in this RFC. However this
> name is not clear enough to indicate its purpose to user. Back to 2010
> vfio even introduced a /dev/uiommu [1] as the predecessor of its
> container concept. Is that a better name? Appreciate opinions here.
>
> [1] https://lore.kernel.org/kvm/4c0eb470.1HMjondO00NIvFM6%25pugs@xxxxxxxxx/
>
> Signed-off-by: Liu Yi L <yi.l.liu@xxxxxxxxx>
> drivers/iommu/Kconfig | 1 +
> drivers/iommu/Makefile | 1 +
> drivers/iommu/iommufd/Kconfig | 11 ++++
> drivers/iommu/iommufd/Makefile | 2 +
> drivers/iommu/iommufd/iommufd.c | 112 ++++++++++++++++++++++++++++++++
> 5 files changed, 127 insertions(+)
> create mode 100644 drivers/iommu/iommufd/Kconfig
> create mode 100644 drivers/iommu/iommufd/Makefile
> create mode 100644 drivers/iommu/iommufd/iommufd.c
>
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 07b7c25cbed8..a83ce0acd09d 100644
> +++ b/drivers/iommu/Kconfig
> @@ -136,6 +136,7 @@ config MSM_IOMMU
>
> source "drivers/iommu/amd/Kconfig"
> source "drivers/iommu/intel/Kconfig"
> +source "drivers/iommu/iommufd/Kconfig"
>
> config IRQ_REMAP
> bool "Support for Interrupt Remapping"
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index c0fb0ba88143..719c799f23ad 100644
> +++ b/drivers/iommu/Makefile
> @@ -29,3 +29,4 @@ obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
> obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
> obj-$(CONFIG_IOMMU_SVA_LIB) += iommu-sva-lib.o io-pgfault.o
> obj-$(CONFIG_SPRD_IOMMU) += sprd-iommu.o
> +obj-$(CONFIG_IOMMUFD) += iommufd/
> diff --git a/drivers/iommu/iommufd/Kconfig b/drivers/iommu/iommufd/Kconfig
> new file mode 100644
> index 000000000000..9fb7769a815d
> +++ b/drivers/iommu/iommufd/Kconfig
> @@ -0,0 +1,11 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +config IOMMUFD
> + tristate "I/O Address Space management framework for passthrough devices"
> + select IOMMU_API
> + default n
> + help
> + provides unified I/O address space management framework for
> + isolating untrusted DMAs via devices which are passed through
> + to userspace drivers.
> +
> + If you don't know what to do here, say N.
> diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
> new file mode 100644
> index 000000000000..54381a01d003
> +++ b/drivers/iommu/iommufd/Makefile
> @@ -0,0 +1,2 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-$(CONFIG_IOMMUFD) += iommufd.o
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> new file mode 100644
> index 000000000000..710b7e62988b
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -0,0 +1,112 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * I/O Address Space Management for passthrough devices
> + *
> + * Copyright (C) 2021 Intel Corporation
> + *
> + * Author: Liu Yi L <yi.l.liu@xxxxxxxxx>
> + */
> +
> +#define pr_fmt(fmt) "iommufd: " fmt
> +
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/miscdevice.h>
> +#include <linux/mutex.h>
> +#include <linux/iommu.h>
> +
> +/* Per iommufd */
> +struct iommufd_ctx {
> + refcount_t refs;
> +};

A private_data of a struct file should avoid having a refcount (and
this should have been a kref anyhow)

Use the refcount on the struct file instead.

In general the lifetime models look overly convoluted to me with
refcounts being used as locks and going in all manner of directions.

- No refcount on iommufd_ctx, this should use the fget on the fd.
The driver facing version of the API has the driver holds a fget
inside the iommufd_device.

- Put a rwlock inside the iommufd_ioas that is a
'destroying_lock'. The rwlock starts out unlocked.

Acquire from the xarray is
rcu_lock()
ioas = xa_load()
if (ioas)
if (down_read_trylock(&ioas->destroying_lock))
// success
Unacquire is just up_read()

Do down_write when the ioas is to be destroyed, do not return ebusy.

- Delete the iommufd_ctx->lock. Use RCU to protect load, erase/alloc does
not need locking (order it properly too, it is in the wrong order), and
don't check for duplicate devices or dev_cookie duplication, that
is user error and is harmless to the kernel.

> +static int iommufd_fops_release(struct inode *inode, struct file *filep)
> +{
> + struct iommufd_ctx *ictx = filep->private_data;
> +
> + filep->private_data = NULL;

unnecessary

> + iommufd_ctx_put(ictx);
> +
> + return 0;
> +}
> +
> +static long iommufd_fops_unl_ioctl(struct file *filep,
> + unsigned int cmd, unsigned long arg)
> +{
> + struct iommufd_ctx *ictx = filep->private_data;
> + long ret = -EINVAL;
> +
> + if (!ictx)
> + return ret;

impossible

> +
> + switch (cmd) {
> + default:
> + pr_err_ratelimited("unsupported cmd %u\n", cmd);

don't log user triggerable events

Jason