Re: [PATCH] iommu/arm-smmu: Use pm_runtime in fault handlers

From: Robin Murphy

Date: Wed Jan 28 2026 - 13:45:08 EST

[ +Pranjal as this might matter for v3 too... ]

On 28/01/2026 5:56 am, Prakash Gupta wrote:

On 1/27/2026 9:35 PM, Robin Murphy wrote:

On 2026-01-27 12:11 pm, Prakash Gupta wrote:

Commit d4a44f0750bb ("iommu/arm-smmu: Invoke pm_runtime across the
driver")
enabled pm_runtime for the arm-smmu device. On systems where the SMMU
sits in a power domain, all register accesses must be done while the
device is runtime-resumed to avoid unclocked register reads and
potential NoC errors.

So far, this has not been an issue for most SMMU clients because
stall-on-fault is enabled by default. While a translation fault is
being handled, the SMMU stalls further translations for that context
bank, so the fault handler would not race with a powered-down
SMMU.

Adreno SMMU now disables stall-on-fault in the presence of fault
storms to avoid saturating SMMU resources and hanging the GMU. With
stall-on-fault disabled, the SMMU can generate faults while its power
domain may no longer be enabled, which makes unclocked accesses to
fault-status registers in the SMMU fault handlers possible.

At face value, that sounds wrong - how does an SMMU generate a fault,
or indeed do anything, when it's powered off? In principle it's
possible that the SMMU might signal an interrupt, and is _then_
suspended (with the interrupt line somehow remaining asserted, so
probably more clock-gated than completely powered down) before the
interrupt hander runs, but we rather assume that we're not going to
have an unhandled hardware IRQ hanging around for longer than the
autosuspend delay.

So, judging by the diff below, I guess what you really mean is that in
the case of a threaded context IRQ handler, it can take long enough
between handling the hardware IRQ and the thread actually running that
the SMMU may have suspended in between?

You are correct that the SMMU cannot generate a fault while powered
down. A more accurate description of the race condition is as follows:
When stall-on-fault is disabled, the faulting transaction does is
terminated. This allows the master (the GPU) to complete its work, drop
its power vote for the SMMU, and allow the SMMU to suspend. However, the
SMMU fault handler may still be waiting to execute on the CPU.
If the SMMU suspends before the handler reads the fault registers, an
unclocked access occurs. This scenario is significantly more likely when
using threaded IRQs due to the scheduling latency involved. I will
update the next iteration to reflect this.

Guard the context and global fault handlers with arm_smmu_rpm_get() /
arm_smmu_rpm_put() so that all SMMU fault register accesses are done
with the SMMU powered.

Fixes: b13044092c1e ("drm/msm: Temporarily disable stall-on-fault
after a page fault")
Co-developed-by: Pratyush Brahma <pratyush.brahma@xxxxxxxxxxxxxxxx>
Signed-off-by: Pratyush Brahma <pratyush.brahma@xxxxxxxxxxxxxxxx>
Signed-off-by: Prakash Gupta <prakash.gupta@xxxxxxxxxxxxxxxx>
---
drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 5 ++-
drivers/iommu/arm/arm-smmu/arm-smmu.c      | 53
++++++++++++++++++++++--------
2 files changed, 43 insertions(+), 15 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index 573085349df3..2d03df72612d 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -317,6 +317,7 @@ static int qcom_adreno_smmu_init_context(struct
arm_smmu_domain *smmu_domain,
      struct arm_smmu_device *smmu = smmu_domain->smmu;
      struct qcom_smmu *qsmmu = to_qcom_smmu(smmu);
      const struct of_device_id *client_match;
+    const struct arm_smmu_impl *impl = qsmmu->data->impl;
      int cbndx = smmu_domain->cfg.cbndx;
      struct adreno_smmu_priv *priv;
@@ -350,10 +351,12 @@ static int
qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
      priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
      priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
      priv->get_fault_info = qcom_adreno_smmu_get_fault_info;
-    priv->set_stall = qcom_adreno_smmu_set_stall;
      priv->set_prr_bit = NULL;
      priv->set_prr_addr = NULL;
+    if (impl->context_fault_needs_threaded_irq)
+        priv->set_stall = qcom_adreno_smmu_set_stall;
+
      if (of_device_is_compatible(np, "qcom,smmu-500") &&
          !of_device_is_compatible(np, "qcom,sm8250-smmu-500") &&
          of_device_is_compatible(np, "qcom,adreno-smmu")) {
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 5e690cf85ec9..183f12e45b02 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -462,10 +462,23 @@ static irqreturn_t arm_smmu_context_fault(int
irq, void *dev)
      int idx = smmu_domain->cfg.cbndx;
      int ret;
+    if (smmu->impl && smmu->impl->context_fault_needs_threaded_irq) {

Why is this conditional on being threaded, if the global fault handler
that can never be threaded at all apparently needs it unconditionally?

Synchronous runtime PM calls can sleep, which would cause issue if
called within a hard IRQ context. This is why I added the conditional
check for threaded IRQs.
Furthermore, this change only allow the driver to override the
stall-on-fault setting when context_fault_needs_threaded_irq is true.
Since the unclocked access issue is tied to disabling stall-on-fault,
the fix is only logically required for the threaded IRQ path.
For the Global Fault handler, which runs in a hard IRQ context, you are
right—we cannot safely vote for power there. I will remove the runtime
PM call from that section.

Hmm, but then how *do* we actually guarantee that autosuspend doesn't happen to kick in and power down the SMMU just as a hardirq handler runs, when there's some unexpected event? I fear there's a horrible can of worms here...

Thanks,
Robin.