Re: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

From: Jason Gunthorpe
Date: Tue Sep 21 2021 - 13:09:49 EST


On Sun, Sep 19, 2021 at 02:38:34PM +0800, Liu Yi L wrote:
> From: Lu Baolu <baolu.lu@xxxxxxxxxxxxxxx>
>
> This extends iommu core to manage security context for passthrough
> devices. Please bear a long explanation for how we reach this design
> instead of managing it solely in iommufd like what vfio does today.
>
> Devices which cannot be isolated from each other are organized into an
> iommu group. When a device is assigned to the user space, the entire
> group must be put in a security context so that user-initiated DMAs via
> the assigned device cannot harm the rest of the system. No user access
> should be granted on a device before the security context is established
> for the group which the device belongs to.

> Managing the security context must meet below criteria:
>
> 1) The group is viable for user-initiated DMAs. This implies that the
> devices in the group must be either bound to a device-passthrough

s/a/the same/

> framework, or driver-less, or bound to a driver which is known safe
> (not do DMA).
>
> 2) The security context should only allow DMA to the user's memory and
> devices in this group;
>
> 3) After the security context is established for the group, the group
> viability must be continuously monitored before the user relinquishes
> all devices belonging to the group. The viability might be broken e.g.
> when a driver-less device is later bound to a driver which does DMA.
>
> 4) The security context should not be destroyed before user access
> permission is withdrawn.
>
> Existing vfio introduces explicit container/group semantics in its uAPI
> to meet above requirements. A single security context (iommu domain)
> is created per container. Attaching group to container moves the entire
> group into the associated security context, and vice versa. The user can
> open the device only after group attach. A group can be detached only
> after all devices in the group are closed. Group viability is monitored
> by listening to iommu group events.
>
> Unlike vfio, iommufd adopts a device-centric design with all group
> logistics hidden behind the fd. Binding a device to iommufd serves
> as the contract to get security context established (and vice versa
> for unbinding). One additional requirement in iommufd is to manage the
> switch between multiple security contexts due to decoupled bind/attach:

This should be a precursor series that actually does clean things up
properly. There is no reason for vfio and iommufd to differ here, if
we are implementing this logic into the iommu layer then it should be
deleted from the VFIO layer, not left duplicated like this.

IIRC in VFIO the container is the IOAS and when the group goes to
create the device fd it should simply do the
iommu_device_init_user_dma() followed immediately by a call to bind
the container IOAS as your #3.

Then delete all the group viability stuff from vfio, relying on the
iommu to do it.

It should have full symmetry with the iommufd.

> @@ -1664,6 +1671,17 @@ static int iommu_bus_notifier(struct notifier_block *nb,
> group_action = IOMMU_GROUP_NOTIFY_BIND_DRIVER;
> break;
> case BUS_NOTIFY_BOUND_DRIVER:
> + /*
> + * FIXME: Alternatively the attached drivers could generically
> + * indicate to the iommu layer that they are safe for keeping
> + * the iommu group user viable by calling some function around
> + * probe(). We could eliminate this gross BUG_ON() by denying
> + * probe to non-iommu-safe driver.
> + */
> + mutex_lock(&group->mutex);
> + if (group->user_dma_owner_id)
> + BUG_ON(!iommu_group_user_dma_viable(group));
> + mutex_unlock(&group->mutex);

And the mini-series should fix this BUG_ON properly by interlocking
with the driver core to simply refuse to bind a driver under these
conditions instead of allowing userspace to crash the kernel.

That alone would be justification enough to merge this work.

> +
> +/*
> + * IOMMU core interfaces for iommufd.
> + */
> +
> +/*
> + * FIXME: We currently simply follow vifo policy to mantain the group's
> + * viability to user. Eventually, we should avoid below hard-coded list
> + * by letting drivers indicate to the iommu layer that they are safe for
> + * keeping the iommu group's user aviability.
> + */
> +static const char * const iommu_driver_allowed[] = {
> + "vfio-pci",
> + "pci-stub"
> +};

Yuk. This should be done with some callback in those drivers
'iomm_allow_user_dma()"

Ie the basic flow would see the driver core doing some:

ret = iommu_doing_kernel_dma()
if (ret) do not bind
driver_bind
pci_stub_probe()
iommu_allow_user_dma()

And the various functions are manipulating some atomic.
0 = nothing happening
1 = kernel DMA
2 = user DMA

No BUG_ON.

Jason