Re: [PATCH v7 12/12] iommu: Improve iopf_queue_flush_dev()

From: Jason Gunthorpe
Date: Mon Dec 04 2023 - 20:53:14 EST


On Tue, Dec 05, 2023 at 01:32:26AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@xxxxxxxx>
> > Sent: Monday, December 4, 2023 9:25 PM
> >
> > On Mon, Dec 04, 2023 at 05:37:13AM +0000, Tian, Kevin wrote:
> > > > From: Baolu Lu <baolu.lu@xxxxxxxxxxxxxxx>
> > > > Sent: Monday, December 4, 2023 9:33 AM
> > > >
> > > > On 12/3/23 10:14 PM, Jason Gunthorpe wrote:
> > > > > On Sun, Dec 03, 2023 at 04:53:08PM +0800, Baolu Lu wrote:
> > > > >> Even if atomic replacement were to be implemented,
> > > > >> it would be necessary to ensure that all translation requests,
> > > > >> translated requests, page requests and responses for the old domain
> > are
> > > > >> drained before switching to the new domain.
> > > > >
> > > > > Again, no it isn't required.
> > > > >
> > > > > Requests simply have to continue to be acked, it doesn't matter if
> > > > > they are acked against the wrong domain because the device will simply
> > > > > re-issue them..
> > > >
> > > > Ah! I start to get your point now.
> > > >
> > > > Even a page fault response is postponed to a new address space, which
> > > > possibly be another address space or hardware blocking state, the
> > > > hardware just retries.
> > >
> > > if blocking then the device shouldn't retry.
> >
> > It does retry.
> >
> > The device is waiting on a PRI, it gets back an completion. It issues
> > a new ATS (this is the rety) and the new-domain responds back with a
> > failure indication.
>
> I'm not sure that is the standard behavior defined by PCIe spec.

> According to "10.4.2 Page Request Group Response Message", function's
> response to Page Request failure is implementation specific.
>
> so a new ATS is optional and likely the device will instead abort the DMA
> if PRI response already indicates a failure.

I didn't said the PRI would fail, I said the ATS would fail with a
non-present.

It has to work this way or it is completely broken with respect to
existing races in the mm side. Agents must retry non-present ATS
answers until you get a present or a ATS failure.

> > Again, all racy. If a DMA is ongoing at the same instant things are
> > changed there is no definitive way to say if it resolved before or
> > after.
> >
> > The only thing we care about is that dmas that are completed before
> > see the before translation and dmas that are started after see the
> > after translation.
> >
> > DMAs that cross choose one at random.
>
> Yes that makes sense for replacement.
>
> But here we are talking about a draining requirement when disabling
> a pasid entry, which is certainly not involved in replacement.

It is the same argument, you are replacing a PTE that was non-present
with one that is failing/blocking - the result of a DMA that crosses
this event can be either.

> > > I don't think atomic replace is the main usage for this draining
> > > requirement. Instead I'm more interested in the basic popular usage:
> > > attach-detach-attach and not convinced that no draining is required
> > > between iommu/device to avoid interference between activities
> > > from old/new address space.
> >
> > Something like IDXD needs to halt DMAs on the PASID and flush all
> > outstanding DMA to get to a state where the PASID is quiet from the
> > device perspective. This is the only way to stop interference.
>
> why is it IDXD specific behavior? I suppose all devices need to quiesce
> the outstanding DMAs when tearing down the binding between the
> PASID and previous address space.

Because it is so simple HW I assume this is why this code is being
pushed here :)

> but there are also terminal conditions e.g. when a workqueue is
> reset after hang hence additional draining is required from the
> iommu side to ensure all the outstanding page requests/responses
> are properly handled.

Then it should be coded as an explicit drain request from device when
and where they need it.

It should not be integrated into the iommu side because it is
nonsensical. Devices expecting consistent behavior must stop DMA
before changing translation, and if they need help to do it they must
call APIs. Changing translation is not required after a so called
"terminal event".

> vt-d spec defines a draining process to cope with those terminal
> conditions (see 7.9 Pending Page Request Handling on Terminal
> Conditions). intel-iommu driver just implements it by default for
> simplicity (one may consider providing explicit API for drivers to
> call but not sure of the necessity if such terminal conditions
> apply to most devices). anyway this is not a fast path.

It is not "by default" it is in the wrong place. These terminal
conditions are things like FLR. FLR has nothing to do with changing
the translation. I can trigger FLR and keep the current translation
and still would want to flush out all the PRIs before starting DMA
again to avoid protocol confusion.

An API is absolutely necessary. Confusing the cases that need draining
with translation change is just not logically right.

eg we do need to modify VFIO to do the drain on FLR like the spec
explains!

Draining has to be ordered correctly with whatever the device is
doing. Drain needs to come after FLR, for instance. It needs to come
after a work queue reset, because drain doesn't make any sense unless
it is coupled with a DMA stop at the device.

Hacking a DMA stop by forcing a blocking translation is not logically
correct, with wrong ordering the device may see unexpected translation
failures which may trigger AERs or bad things..

> another example might be stop marker. A device using stop marker
> doesn't need to wait for outstanding page requests. According to PCIe
> spec (10.4.1.2 Managing PASID Usage on PRG Requests) the device
> simply marks outstanding page request as stale and sends a stop
> marker message to the IOMMU. Page responses for those stale
> requests are ignored. But presumably the iommu driver still needs
> to drain those requests until the stop marker message in unbind
> to avoid them incorrectly routed to a new address space in case the
> PASID is rebound to another process immediately.

Stop marker doesn't change anything, in all processing it just removes
requests that have yet to complete. If a device is using stop then
most likely the whole thing is racy and the OS simply has to be ready
to handle stop at any time.

Jason