Re: Plan for /dev/ioasid RFC v2

From: Jason Gunthorpe
Date: Thu Jun 17 2021 - 20:11:02 EST


On Tue, Jun 15, 2021 at 10:12:15AM -0600, Alex Williamson wrote:
>
> 1) A dual-function PCIe e1000e NIC where the functions are grouped
> together due to ACS isolation issues.
>
> a) Initial state: functions 0 & 1 are both bound to e1000e driver.
>
> b) Admin uses driverctl to bind function 1 to vfio-pci, creating
> vfio device file, which is chmod'd to grant to a user.
>
> c) User opens vfio function 1 device file and an iommu_fd, binds
> device_fd to iommu_fd.
>
> Does this succeed?
> - if no, specifically where does it fail?

No, the e1000e driver is still connected to the device.

It fails during the VFIO_BIND_IOASID_FD call because the iommu common
code checks the group membership for consistency.

We detect it basically the same way things work today, just moved to
the iommu code.

> d) Repeat b) for function 0.
> e) Repeat c), still using function 1, is it different? Where? Why?

Succeeds because all group device members are now bound to vfio

It is hard to predict the nicest way to do all of this, but I would
start by imagining that iommu_fd using drivers (like vfio) will call
some kind of iommu_fd_allow_dma_blocking() call during their probe()
which organizes the machinery to drive this.

> 2) The same NIC as 1)
>
> a) Initial state: functions 0 & 1 bound to vfio-pci, vfio device
> files granted to user, user has bound both device_fds to the same
> iommu_fd.
>
> AIUI, even though not bound to an IOASID, vfio can now enable access
> through the device_fds, right?

Yes

> What specific entity has placed these
> devices into a block DMA state, when, and how?

To keep all the semantics the same it must be done as part of
VFIO_BIND_IOASID_FD.

This will have to go over every device in the group and put it in the
dma blocked state. Riffing on the above this is possible if there is
no attached device driver, or the device driver that is attached has
called iommu_fd_allow_dma_blocking() during its probe()

I haven't gone through all of Kevins notes about how this could be
sorted out directly in the iomumu code though..

> b) Both devices are attached to the same IOASID.
>
> Are we assuming that each device was atomically moved to the new
> IOMMU context by the IOASID code? What if the IOMMU cannot change
> the domain atomically?

What does "atomically" mean here? I assume all IOMMU HW can
change IOASIDs without accidentally leaking traffic
through.

Otherwise that is a major design restriction..

> c) The device_fd for function 1 is detached from the IOASID.
>
> Are we assuming the reverse of b) performed by the IOASID code?

Yes, the IOMMU will change from the active IOASID to the "block DMA"
ioasid in a way that is secure.

> d) The device_fd for function 1 is unbound from the iommu_fd.
>
> Does this succeed?

Yes

> - if yes, what is the resulting IOMMU context of the device and
> who owns it?

device_fd for function 1 remains set to the "block DMA"
ioasid.

Attempting to attach a kernel driver triggers bug_on as today

Attempting to open it again and use it with a different iommu_fd fails

> e) Function 1 is unbound from vfio-pci.
>
> Does this work or is it blocked? If blocked, by what entity
> specifically?

As today, it is allowed. The IOASID would have to remain at the "block
all dma" until the implicit connection to the group in the iommu_fd is
released.

> f) Function 1 is bound to e1000e driver.

As today bug_on is triggered via the same maze of notifiers (gross,
but where we are for now). The notifiers would be done by the iommu_fd
instead of vfio

> 3) A dual-function conventional PCI e1000 NIC where the functions are
> grouped together due to shared RID.

This operates effectively the same as today. Manipulating a device
implicitly manipulates the group. Instead of doing dma block the
devices track the IOASID the group is using.

We model it by demanding that all devices attach to the same IOASID
and instead of doing the DMA block step the device remains attached to
the group's IOASID. Today this is such an uncommon configuration (a
PCI bridge!) we shouldn't design the entire API around it.

> If vfio gets to offload all of it's group management to IOASID code,
> that's great, but I'm afraid that IOASID is so focused on a
> device-level API that we're instead just ignoring the group dynamics
> and vfio will be forced to provide oversight to maintain secure
> userspace access.

I think it would be a major design failure if VFIO is required to
provide additional security on top of the iommu code. This is
basically the refactoring excercise - to move the VFIO code that is
only about iommu concerns to the iommu layer and VFIO becomes thinner.

Otherwise we still can't properly share this code - why should VDPA
and VFIO have different isolation models? Is it just because we expect
that everything except VFIO has 1:1 groups or not group at all? Feels
wonky.

Jason