RE: [PATCH v7 12/12] iommu: Improve iopf_queue_flush_dev()

From: Tian, Kevin
Date: Mon Dec 04 2023 - 22:23:15 EST


> From: Jason Gunthorpe <jgg@xxxxxxxx>
> Sent: Tuesday, December 5, 2023 9:53 AM
>
> On Tue, Dec 05, 2023 at 01:32:26AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@xxxxxxxx>
> > > Sent: Monday, December 4, 2023 9:25 PM
> > >
> > > On Mon, Dec 04, 2023 at 05:37:13AM +0000, Tian, Kevin wrote:
> > > > > From: Baolu Lu <baolu.lu@xxxxxxxxxxxxxxx>
> > > > > Sent: Monday, December 4, 2023 9:33 AM
> > > > >
> > > > > On 12/3/23 10:14 PM, Jason Gunthorpe wrote:
> > > > > > On Sun, Dec 03, 2023 at 04:53:08PM +0800, Baolu Lu wrote:
> > > > > >> Even if atomic replacement were to be implemented,
> > > > > >> it would be necessary to ensure that all translation requests,
> > > > > >> translated requests, page requests and responses for the old
> domain
> > > are
> > > > > >> drained before switching to the new domain.
> > > > > >
> > > > > > Again, no it isn't required.
> > > > > >
> > > > > > Requests simply have to continue to be acked, it doesn't matter if
> > > > > > they are acked against the wrong domain because the device will
> simply
> > > > > > re-issue them..
> > > > >
> > > > > Ah! I start to get your point now.
> > > > >
> > > > > Even a page fault response is postponed to a new address space,
> which
> > > > > possibly be another address space or hardware blocking state, the
> > > > > hardware just retries.
> > > >
> > > > if blocking then the device shouldn't retry.
> > >
> > > It does retry.
> > >
> > > The device is waiting on a PRI, it gets back an completion. It issues
> > > a new ATS (this is the rety) and the new-domain responds back with a
> > > failure indication.
> >
> > I'm not sure that is the standard behavior defined by PCIe spec.
>
> > According to "10.4.2 Page Request Group Response Message", function's
> > response to Page Request failure is implementation specific.
> >
> > so a new ATS is optional and likely the device will instead abort the DMA
> > if PRI response already indicates a failure.
>
> I didn't said the PRI would fail, I said the ATS would fail with a
> non-present.
>
> It has to work this way or it is completely broken with respect to
> existing races in the mm side. Agents must retry non-present ATS
> answers until you get a present or a ATS failure.

My understanding of the sequence is like below:

<'D' for device, 'I' for IOMMU>

(D) send a ATS translation request
(I) respond translation result
(D) If success then sends DMA to the target page
otherwise send a PRI request
(I) raise an IOMMU interrupt allowing sw to fix the translation
(I) generate a PRI response to device
(D) if success then jump to the first step to retry
otherwise abort the current request

If mm changes the mapping after a success PRI response, mmu notifier
callback in iommu driver needs to wait for device tlb invalidation completion
which the device will order properly with outstanding DMA requests using
the old translation.

If you refer to the 'retry' after receiving a success PRI response, then yes.

but there is really no reason to retry upon a PRI response failure which
indicates that the faulting address is not a valid one which OS would like
to fix.

>
> > > Again, all racy. If a DMA is ongoing at the same instant things are
> > > changed there is no definitive way to say if it resolved before or
> > > after.
> > >
> > > The only thing we care about is that dmas that are completed before
> > > see the before translation and dmas that are started after see the
> > > after translation.
> > >
> > > DMAs that cross choose one at random.
> >
> > Yes that makes sense for replacement.
> >
> > But here we are talking about a draining requirement when disabling
> > a pasid entry, which is certainly not involved in replacement.
>
> It is the same argument, you are replacing a PTE that was non-present

s/non-present/present/

> with one that is failing/blocking - the result of a DMA that crosses
> this event can be either.

kind of

>
> > > > I don't think atomic replace is the main usage for this draining
> > > > requirement. Instead I'm more interested in the basic popular usage:
> > > > attach-detach-attach and not convinced that no draining is required
> > > > between iommu/device to avoid interference between activities
> > > > from old/new address space.
> > >
> > > Something like IDXD needs to halt DMAs on the PASID and flush all
> > > outstanding DMA to get to a state where the PASID is quiet from the
> > > device perspective. This is the only way to stop interference.
> >
> > why is it IDXD specific behavior? I suppose all devices need to quiesce
> > the outstanding DMAs when tearing down the binding between the
> > PASID and previous address space.
>
> Because it is so simple HW I assume this is why this code is being
> pushed here :)
>
> > but there are also terminal conditions e.g. when a workqueue is
> > reset after hang hence additional draining is required from the
> > iommu side to ensure all the outstanding page requests/responses
> > are properly handled.
>
> Then it should be coded as an explicit drain request from device when
> and where they need it.
>
> It should not be integrated into the iommu side because it is
> nonsensical. Devices expecting consistent behavior must stop DMA
> before changing translation, and if they need help to do it they must
> call APIs. Changing translation is not required after a so called
> "terminal event".
>
> > vt-d spec defines a draining process to cope with those terminal
> > conditions (see 7.9 Pending Page Request Handling on Terminal
> > Conditions). intel-iommu driver just implements it by default for
> > simplicity (one may consider providing explicit API for drivers to
> > call but not sure of the necessity if such terminal conditions
> > apply to most devices). anyway this is not a fast path.
>
> It is not "by default" it is in the wrong place. These terminal
> conditions are things like FLR. FLR has nothing to do with changing
> the translation. I can trigger FLR and keep the current translation
> and still would want to flush out all the PRIs before starting DMA
> again to avoid protocol confusion.
>
> An API is absolutely necessary. Confusing the cases that need draining
> with translation change is just not logically right.
>
> eg we do need to modify VFIO to do the drain on FLR like the spec
> explains!
>
> Draining has to be ordered correctly with whatever the device is
> doing. Drain needs to come after FLR, for instance. It needs to come
> after a work queue reset, because drain doesn't make any sense unless
> it is coupled with a DMA stop at the device.

Okay that makes sense. As Baolu and you already agreed let's separate
this fix out of this series.

The minor interesting aspect is how to document this requirement
clearly so drivers won't skip calling it when sva is enabled.

>
> Hacking a DMA stop by forcing a blocking translation is not logically
> correct, with wrong ordering the device may see unexpected translation
> failures which may trigger AERs or bad things..

where is such hack? though the current implementation of draining
is not clean, it's put inside pasid-disable-sequence instead of forcing
a blocking translation implicitly in iommu driver i.e. it's still the driver
making decision for what translation to be used...

>
> > another example might be stop marker. A device using stop marker
> > doesn't need to wait for outstanding page requests. According to PCIe
> > spec (10.4.1.2 Managing PASID Usage on PRG Requests) the device
> > simply marks outstanding page request as stale and sends a stop
> > marker message to the IOMMU. Page responses for those stale
> > requests are ignored. But presumably the iommu driver still needs
> > to drain those requests until the stop marker message in unbind
> > to avoid them incorrectly routed to a new address space in case the
> > PASID is rebound to another process immediately.
>
> Stop marker doesn't change anything, in all processing it just removes
> requests that have yet to complete. If a device is using stop then
> most likely the whole thing is racy and the OS simply has to be ready
> to handle stop at any time.
>

I'm not sure whether a device supporting stop marker may provide
certain abort-dma cmd to not wait. But guess such abort semantics
will be likely used in terminal condition too so the same argument
still hold. Presumably normal translation change should always use
a stop-wait semantics for outstanding DMAs. 😊