Re: [PATCH v7 12/12] iommu: Improve iopf_queue_flush_dev()

From: Baolu Lu
Date: Sun Dec 03 2023 - 20:38:21 EST

Next message: NeilBrown: "[PATCH 0/2 v2] Move all file-close work for nfsd into nfsd threads"
Previous message: NeilBrown: "Re: [PATCH/RFC] core/nfsd: allow kernel threads to use task_work."
In reply to: Jason Gunthorpe: "Re: [PATCH v7 12/12] iommu: Improve iopf_queue_flush_dev()"
Next in thread: Tian, Kevin: "RE: [PATCH v7 12/12] iommu: Improve iopf_queue_flush_dev()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 12/3/23 10:14 PM, Jason Gunthorpe wrote:

On Sun, Dec 03, 2023 at 04:53:08PM +0800, Baolu Lu wrote:

On 12/2/23 4:35 AM, Jason Gunthorpe wrote:

On Wed, Nov 15, 2023 at 11:02:26AM +0800, Lu Baolu wrote:

The iopf_queue_flush_dev() is called by the iommu driver before releasing
a PASID. It ensures that all pending faults for this PASID have been
handled or cancelled, and won't hit the address space that reuses this
PASID. The driver must make sure that no new fault is added to the queue.

This needs more explanation, why should anyone care?

More importantly, why is*discarding* the right thing to do?
Especially why would we discard a partial page request group?

After we change a translation we may have PRI requests in a
queue. They need to be acknowledged, not discarded. The DMA in the
device should be restarted and the device should observe the new
translation - if it is blocking then it should take a DMA error.

More broadly, we should just let things run their normal course. The
domain to deliver the fault to should be determined very early. If we
get a fault and there is no fault domain currently assigned then just
restart it.

The main reason to fence would be to allow the domain to become freed
as the faults should be holding pointers to it. But I feel there are
simpler options for that then this..

In the iommu_detach_device_pasid() path, the domain is about to be
removed from the pasid of device. The IOMMU driver performs the
following steps sequentially:

I know that is why it does, but it doesn't explain at all why.

1. Clears the pasid translation entry. Thus, all subsequent DMA
transactions (translation requests, translated requests or page
requests) targeting the iommu domain will be blocked.

2. Waits until all pending page requests for the device's PASID have
been reported to upper layers via the iommu_report_device_fault().
However, this does not guarantee that all page requests have been
responded.

3. Free all partial page requests for this pasid since the page request
response is only needed for a complete request group. There's no
action required for the page requests which are not last of a request
group.

But we expect the last to come eventually since everything should be
grouped properly, so why bother doing this?

Indeed if 2 worked, how is this even possible to have partials?

Step 1 clears the pasid table entry, hence all subsequent page requests
are blocked (hardware auto-respond the request but not put it in the
queue).

It is possible that a portion of a page fault group may have been queued
for processing, but the last request is being blocked by hardware due
to the pasid entry being in the blocking state.

In reality, this may be a no-op as I haven't seen any real-world
implementations of multiple-requests fault groups on Intel platforms.

5. Follow the IOMMU hardware requirements (for example, VT-d sepc,
section 7.10, Software Steps to Drain Page Requests & Responses) to
drain in-flight page requests and page group responses between the
remapping hardware queues and the endpoint device.

With above steps done in iommu_detach_device_pasid(), the pasid could be
re-used for any other address space.

As I said, that isn't even required. There is no issue with leaking
PRI's across attachments.

I suppose the driver is expected to stop calling
iommu_report_device_fault() before calling this function, but that
doesn't seem like it is going to be possible. Drivers should be
implementing atomic replace for the PASID updates and in that case
there is no momement when it can say the HW will stop generating PRI.

Atomic domain replacement for a PASID is not currently implemented in
the core or driver.

It is, the driver should implement set_dev_pasid in such a way that
repeated calls do replacements, ideally atomically. This is what ARM
SMMUv3 does after my changes.

Even if atomic replacement were to be implemented,
it would be necessary to ensure that all translation requests,
translated requests, page requests and responses for the old domain are
drained before switching to the new domain.

Again, no it isn't required.

Requests simply have to continue to be acked, it doesn't matter if
they are acked against the wrong domain because the device will simply
re-issue them..

Ah! I start to get your point now.

Even a page fault response is postponed to a new address space, which
possibly be another address space or hardware blocking state, the
hardware just retries.

As long as we flushes all caches (IOTLB and device TLB) during switching, the mappings of the old domain won't leak. So it's safe to keep page requests there.

Do I get you correctly?

Best regards,
baolu

Next message: NeilBrown: "[PATCH 0/2 v2] Move all file-close work for nfsd into nfsd threads"
Previous message: NeilBrown: "Re: [PATCH/RFC] core/nfsd: allow kernel threads to use task_work."
In reply to: Jason Gunthorpe: "Re: [PATCH v7 12/12] iommu: Improve iopf_queue_flush_dev()"
Next in thread: Tian, Kevin: "RE: [PATCH v7 12/12] iommu: Improve iopf_queue_flush_dev()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]