RE: [PATCH rc v4] iommu: Fix nested pci_dev_reset_iommu_prepare/done()

From: Tian, Kevin

Date: Thu Apr 02 2026 - 03:41:18 EST


Hi, Nicolin,

> From: Nicolin Chen <nicolinc@xxxxxxxxxx>
> Sent: Tuesday, March 24, 2026 9:41 AM
>
> @@ -3961,20 +3986,23 @@ int pci_dev_reset_iommu_prepare(struct
> pci_dev *pdev)
>
> guard(mutex)(&group->mutex);
>
> - /* Re-entry is not allowed */
> - if (WARN_ON(group->resetting_domain))
> - return -EBUSY;
> + gdev = __dev_to_gdev(&pdev->dev);
> + if (WARN_ON(!gdev))
> + return -ENODEV;
> +
> + if (gdev->reset_depth++)
> + return 0;
>
> ret = __iommu_group_alloc_blocking_domain(group);
> if (ret)
> - return ret;
> + goto err_depth;
>
> /* Stage RID domain at blocking_domain while retaining group-
> >domain */
> if (group->domain != group->blocking_domain) {
> ret = __iommu_attach_device(group->blocking_domain,
> &pdev->dev,
> group->domain);
> if (ret)
> - return ret;
> + goto err_depth;
> }
>
> /*
> @@ -3987,7 +4015,11 @@ int pci_dev_reset_iommu_prepare(struct pci_dev
> *pdev)
> iommu_remove_dev_pasid(&pdev->dev, pasid,
> pasid_array_entry_to_domain(entry));
>
> - group->resetting_domain = group->blocking_domain;
> + group->recovery_cnt++;
> + return ret;
> +
> +err_depth:
> + gdev->reset_depth--;
> return ret;
> }

I tried to understand how this change interacts with the use of
iommu_driver_get_domain_for_dev() in arm smmu driver, but in the
end got confusing. Better having Jason take a look.

so far there are only two uses:

1) in arm_smmu_set_pasid():

- if cd is not enabled and sid_domain is not identity or blocked, disallow
set_pasid() probably due to the fact that enabling cd in ste could
break in-fly DMAs if they are remapped

- otherwise, enable cd in ste dynamically and set S1DSS according to
identity or blocked

2) in arm_smmu_blocking_set_dev_pasid():

- if the last user of cd table goes away and sid_domain is identity or
blocked, ste is downgraded to a non-cd format by re-attaching to
the sid_domain

Now a device is being reset. No parallel attach_device or set_dev_pasid
requests are allowed outside of iommu core itself. so the only invocation
exists in pci_dev_reset_iommu_prepare() and pci_dev_reset_iommu_done().

let's assume the device being reset has pasid usage, i.e. ste is in cd format.

In pci_dev_reset_iommu_prepare():

if (gdev->reset_depth++)
return 0;

// now iommu_driver_get_domain_for_dev() returns blocking domain

xa_for_each_start(&group->pasid_array, pasid, entry, 1)
iommu_remove_dev_pasid(&pdev->dev, pasid,
pasid_array_entry_to_domain(entry));

arm_smmu_blocking_set_dev_pasid() sees that sid_domain is blocking then
the last iommu_remove_dev_pasid() leads to cd-format ste downgraded to
a non-cd one.

In pci_dev_reset_iommu_done():

if (gdev->reset_depth > 1) {
gdev->reset_depth--;
return;
}

// restore valid pasid domains

gdev->reset_depth = 0;

Upon the first pasid, arm_smmu_set_pasid() sees that cd is not enabled and
sid_domain is blocking then upgrade the non-cd ste to a cd ste.

this is fine.

but then I wonder what is the real value of iommu_driver_get_domain_for_dev().
w/o it, above flow just skips cd downgrade/upgrade in case group->domain is
blocked or identity. No correctness issue.

But with it there might be corner cases to pay attention in case iommu
driver calls it in other places. e.g. Sashiko [1] pointed out one potential issue:

"
Incrementing gdev->reset_depth at the beginning of the function prematurely
exposes the blocking_domain to the driver. When __iommu_attach_device() is
called above, if the driver's attach_dev callback uses
iommu_driver_get_domain_for_dev() to query its current domain, it will observe
reset_depth == 1 and receive the blocking_domain even though it is not yet
attached to it.
Could reset_depth be set to 1 only after a successful attachment, while nested
calls still check for gdev->reset_depth > 0 at the beginning?
"

and the code comment in iommu_driver_get_domain_for_dev() says:

/*
* Driver handles the low-level __iommu_attach_device(), including the
* one invoked by pci_dev_reset_iommu_done() re-attaching the device to
* the cached group->domain. In this case, the driver must get the old
* domain from group->resetting_domain rather than group->domain. This
* prevents it from re-attaching the device from group->domain (old) to
* group->domain (new).
*/

It is not related to set_dev_pasid() at all. Likely I missed the real reason
of adding it?

[1] https://sashiko.dev/#/patchset/20260324014056.36103-1-nicolinc%40nvidia.com