Re: [PATCH v7 12/12] iommu: Improve iopf_queue_flush_dev()

From: Baolu Lu
Date: Sun Dec 03 2023 - 03:58:17 EST


On 12/2/23 4:35 AM, Jason Gunthorpe wrote:
On Wed, Nov 15, 2023 at 11:02:26AM +0800, Lu Baolu wrote:
The iopf_queue_flush_dev() is called by the iommu driver before releasing
a PASID. It ensures that all pending faults for this PASID have been
handled or cancelled, and won't hit the address space that reuses this
PASID. The driver must make sure that no new fault is added to the queue.
This needs more explanation, why should anyone care?

More importantly, why is*discarding* the right thing to do?
Especially why would we discard a partial page request group?

After we change a translation we may have PRI requests in a
queue. They need to be acknowledged, not discarded. The DMA in the
device should be restarted and the device should observe the new
translation - if it is blocking then it should take a DMA error.

More broadly, we should just let things run their normal course. The
domain to deliver the fault to should be determined very early. If we
get a fault and there is no fault domain currently assigned then just
restart it.

The main reason to fence would be to allow the domain to become freed
as the faults should be holding pointers to it. But I feel there are
simpler options for that then this..

In the iommu_detach_device_pasid() path, the domain is about to be
removed from the pasid of device. The IOMMU driver performs the
following steps sequentially:

1. Clears the pasid translation entry. Thus, all subsequent DMA
transactions (translation requests, translated requests or page
requests) targeting the iommu domain will be blocked.

2. Waits until all pending page requests for the device's PASID have
been reported to upper layers via the iommu_report_device_fault().
However, this does not guarantee that all page requests have been
responded.

3. Free all partial page requests for this pasid since the page request
response is only needed for a complete request group. There's no
action required for the page requests which are not last of a request
group.

4. Iterate through the list of pending page requests and identifies
those originating from the device's PASID. For each identified
request, the driver responds to the hardware with the
IOMMU_PAGE_RESP_INVALID code, indicating that the request cannot be
handled and retries should not be attempted. This response code
corresponds to the "Invalid Request" status defined in the PCI PRI
specification.

5. Follow the IOMMU hardware requirements (for example, VT-d sepc,
section 7.10, Software Steps to Drain Page Requests & Responses) to
drain in-flight page requests and page group responses between the
remapping hardware queues and the endpoint device.

With above steps done in iommu_detach_device_pasid(), the pasid could be
re-used for any other address space.

The iopf_queue_discard_dev_pasid() helper does step 3 and 4.


The SMMUv3 driver doesn't use it because it only implements the
Arm-specific stall fault model where DMA transactions are held in the SMMU
while waiting for the OS to handle iopf's. Since a device driver must
complete all DMA transactions before detaching domain, there are no
pending iopf's with the stall model. PRI support requires adding a call to
iopf_queue_flush_dev() after flushing the hardware page fault queue.
This explanation doesn't make much sense, from a device driver
perspective both PRI and stall cause the device to not complete DMAs.

The difference between stall and PRI is fairly small, stall causes an
internal bus to lock up while PRI does not.

-int iopf_queue_flush_dev(struct device *dev)
+int iopf_queue_discard_dev_pasid(struct device *dev, ioasid_t pasid)
{
struct iommu_fault_param *iopf_param = iopf_get_dev_fault_param(dev);
+ const struct iommu_ops *ops = dev_iommu_ops(dev);
+ struct iommu_page_response resp;
+ struct iopf_fault *iopf, *next;
+ int ret = 0;
if (!iopf_param)
return -ENODEV;
flush_workqueue(iopf_param->queue->wq);
+
A naked flush_workqueue like this is really suspicious, it needs a
comment explaining why the queue can't get more work queued at this
point.

I suppose the driver is expected to stop calling
iommu_report_device_fault() before calling this function, but that
doesn't seem like it is going to be possible. Drivers should be
implementing atomic replace for the PASID updates and in that case
there is no momement when it can say the HW will stop generating PRI.

Atomic domain replacement for a PASID is not currently implemented in
the core or driver. Even if atomic replacement were to be implemented,
it would be necessary to ensure that all translation requests,
translated requests, page requests and responses for the old domain are
drained before switching to the new domain. I am not sure whether the
existing iommu hardware architecture supports this functionality.

Best regards,
baolu