Re: [PATCH v2 1/7] iommu: Do not call pci_dev_reset_iommu_done() unless reset succeeds

From: Shuai Xue

Date: Wed Mar 18 2026 - 04:04:26 EST




On 3/18/26 3:15 AM, Nicolin Chen wrote:
IOMMU drivers handle ATC cache maintenance. They may encounter ATC-related
errors (e.g., ATC invalidation request timeout), indicating that ATC cache
might have stale entries that can corrupt the memory. In this case, IOMMU
driver has no choice but to block the device's ATS function and wait for a
device recovery.

The pci_dev_reset_iommu_done() called at the end of a reset function could
serve as a reliable signal to the IOMMU subsystem that the physical device
cache is completely clean. However, the function is called unconditionally
even if the reset operation had actually failed, which would re-attach the
faulty device back to a normal translation domain. And this will leave the
system highly exposed, creating vulnerabilities for data corruption:
IOMMU blocks RID/ATS
pci_reset_function():
pci_dev_reset_iommu_prepare(); // Block RID/ATS
__reset(); // Failed (ATC is still stale)
pci_dev_reset_iommu_done(); // Unblock RID/ATS (ah-ha)

The simplest fix is to use pci_dev_reset_iommu_done() only on a successful
reset:
IOMMU blocks RID/ATS
pci_reset_function():
pci_dev_reset_iommu_prepare(); // Block RID/ATS
if (!__reset())
pci_dev_reset_iommu_done(); // Unblock RID/ATS
else
// Keep the device blocked by IOMMU

However, this breaks the symmetric requirement of these reset APIs so that
we have to allow a re-entry to pass a second reset attempt:
IOMMU blocks RID/ATS
pci_reset_function():
pci_dev_reset_iommu_prepare(); // Block RID/ATS
__reset(); // Failed (ATC is still stale)
// Keep the device blocked by IOMMU
Second reset:
pci_reset_function():
pci_dev_reset_iommu_prepare(); // Re-entry (!)

Update the function kdocs and all the existing callers to only unblock ATS
when the reset succeeds. Drop the WARN_ON in pci_dev_reset_iommu_prepare()
to allow re-entries.

Signed-off-by: Nicolin Chen <nicolinc@xxxxxxxxxx>
---
drivers/iommu/iommu.c | 16 +++++++++-----
drivers/pci/pci-acpi.c | 11 +++++++++-
drivers/pci/pci.c | 50 +++++++++++++++++++++++++++++++++++++-----
drivers/pci/quirks.c | 11 +++++++++-
4 files changed, 75 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 35db517809540..40a15c9360bd1 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3938,8 +3938,10 @@ EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
* IOMMU activity while leaving the group->domain pointer intact. Later when the
* reset is finished, pci_dev_reset_iommu_done() can restore everything.
*
- * Caller must use pci_dev_reset_iommu_prepare() with pci_dev_reset_iommu_done()
- * before/after the core-level reset routine, to unset the resetting_domain.
+ * Caller must use pci_dev_reset_iommu_done() after a successful PCI-level reset
+ * to unset the resetting_domain. If the reset fails, caller can choose to keep
+ * the device in the resetting_domain to protect system memory using IOMMU from
+ * any bad ATS.

I think you mean ...to protect system memory from DMA via stale ATC entries.

*
* Return: 0 on success or negative error code if the preparation failed.
*
@@ -3961,9 +3963,9 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
guard(mutex)(&group->mutex);
- /* Re-entry is not allowed */
- if (WARN_ON(group->resetting_domain))
- return -EBUSY;
+ /* Already prepared */
+ if (group->resetting_domain)
+ return 0;

Could you elaborate more on why Re-entry is allowed now? For example:

/*
* Allow re-entry: if a previous reset failed, the device remains in
* resetting_domain. A subsequent reset attempt must be able to call
* prepare() again without failing.
*/

ret = __iommu_group_alloc_blocking_domain(group);
if (ret)
@@ -4001,7 +4003,9 @@ EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
* re-attaching all RID/PASID of the device's back to the domains retained in
* the core-level structure.
*
- * Caller must pair it with a successful pci_dev_reset_iommu_prepare().
+ * This is a pairing function for pci_dev_reset_iommu_prepare(). Caller should
+ * use it on a successful PCI-level reset. Otherwise, it's suggested for caller
+ * to keep the device in the resetting_domain to protect system memory.
*
* Note that, although unlikely, there is a risk that re-attaching domains might
* fail due to some unexpected happening like OOM.
diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
index 4d0f2cb6c695b..f1a918938242c 100644
--- a/drivers/pci/pci-acpi.c
+++ b/drivers/pci/pci-acpi.c
@@ -16,6 +16,7 @@
#include <linux/pci_hotplug.h>
#include <linux/module.h>
#include <linux/pci-acpi.h>
+#include <linux/pci-ats.h>
#include <linux/pci-ecam.h>
#include <linux/pm_runtime.h>
#include <linux/pm_qos.h>
@@ -977,7 +978,15 @@ int pci_dev_acpi_reset(struct pci_dev *dev, bool probe)
ret = -ENOTTY;
}
- pci_dev_reset_iommu_done(dev);
+ /*
+ * The reset might be invoked to recover a serious error. E.g. when the
+ * ATC failed to invalidate its stale entries, which can result in data
+ * corruption. Thus, do not unblock ATS until a successful reset.
+ */
+ if (!ret || !pci_ats_supported(dev))
+ pci_dev_reset_iommu_done(dev);

pci_dev_reset_iommu_prepare() already does an early return for non-ATS
devices, and pci_dev_reset_iommu_done() also early-returns
when !pci_ats_supported(). So for non-ATS devices, done()
is always a no-op regardless of whether it's called or not.

Do we really need check pci_ats_supported(dev) here?

+ else
+ pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
return ret;
}
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 8479c2e1f74f1..80c5cf6eeebdc 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4358,7 +4358,15 @@ int pcie_flr(struct pci_dev *dev)
ret = pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
done:
- pci_dev_reset_iommu_done(dev);
+ /*
+ * The reset might be invoked to recover a serious error. E.g. when the
+ * ATC failed to invalidate its stale entries, which can result in data
+ * corruption. Thus, do not unblock ATS until a successful reset.
+ */
+ if (!ret || !pci_ats_supported(dev))
+ pci_dev_reset_iommu_done(dev);
+ else
+ pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
return ret;
}
EXPORT_SYMBOL_GPL(pcie_flr);
@@ -4436,7 +4444,15 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
ret = pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS);
done:
- pci_dev_reset_iommu_done(dev);
+ /*
+ * The reset might be invoked to recover a serious error. E.g. when the
+ * ATC failed to invalidate its stale entries, which can result in data
+ * corruption. Thus, do not unblock ATS until a successful reset.
+ */
+ if (!ret || !pci_ats_supported(dev))
+ pci_dev_reset_iommu_done(dev);
+ else
+ pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
return ret;
}
@@ -4490,7 +4506,15 @@ static int pci_pm_reset(struct pci_dev *dev, bool probe)
pci_dev_d3_sleep(dev);
ret = pci_dev_wait(dev, "PM D3hot->D0", PCIE_RESET_READY_POLL_MS);
- pci_dev_reset_iommu_done(dev);
+ /*
+ * The reset might be invoked to recover a serious error. E.g. when the
+ * ATC failed to invalidate its stale entries, which can result in data
+ * corruption. Thus, do not unblock ATS until a successful reset.
+ */
+ if (!ret || !pci_ats_supported(dev))
+ pci_dev_reset_iommu_done(dev);
+ else
+ pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
return ret;
}
@@ -4933,7 +4957,15 @@ static int pci_reset_bus_function(struct pci_dev *dev, bool probe)
rc = pci_parent_bus_reset(dev, probe);
done:
- pci_dev_reset_iommu_done(dev);
+ /*
+ * The reset might be invoked to recover a serious error. E.g. when the
+ * ATC failed to invalidate its stale entries, which can result in data
+ * corruption. Thus, do not unblock ATS until a successful reset.
+ */
+ if (!rc || !pci_ats_supported(dev))
+ pci_dev_reset_iommu_done(dev);
+ else
+ pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
return rc;
}
@@ -4978,7 +5010,15 @@ static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
pci_write_config_word(bridge, dvsec + PCI_DVSEC_CXL_PORT_CTL,
reg);
- pci_dev_reset_iommu_done(dev);
+ /*
+ * The reset might be invoked to recover a serious error. E.g. when the
+ * ATC failed to invalidate its stale entries, which can result in data
+ * corruption. Thus, do not unblock ATS until a successful reset.
+ */
+ if (!rc || !pci_ats_supported(dev))
+ pci_dev_reset_iommu_done(dev);
+ else
+ pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
return rc;
}

cxl_reset_bus_function() calls pci_reset_bus_function() which already
contains the conditional pci_dev_reset_iommu_done() logic. Then the
outer function calls done() again. The second call is a no-op because
resetting_domain has already been cleared, but this is confusing.

Additionally, cxl_reset_bus_function() does its own
pci_dev_reset_iommu_prepare(), while pci_reset_bus_function() also
calls prepare() internally. With the re-entry change in this patch,
the nested prepare() silently succeeds.

Is it a expected behavior?


Thanks.
Shuai