Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
From: Nicolin Chen
Date: Thu Mar 05 2026 - 16:16:31 EST
On Thu, Mar 05, 2026 at 11:39:11AM -0400, Jason Gunthorpe wrote:
> On Wed, Mar 04, 2026 at 09:21:42PM -0800, Nicolin Chen wrote:
> > + /*
> > + * ATC timeout indicates the device has stopped responding to coherence
> > + * protocol requests. The only safe recovery is a reset to flush stale
> > + * cached translations. Note that pci_reset_function() internally calls
> > + * pci_dev_reset_iommu_prepare/done() as well and ensures to block ATS
> > + * if PCI-level reset fails.
> > + */
> > + if (!pci_reset_function(pdev)) {
> > + /*
> > + * If reset succeeds, set BME back. Otherwise, fence the system
> > + * from a faulty device, in which case user will have to replug
> > + * the device to invoke pci_set_master().
> > + */
> > + pci_dev_lock(pdev);
> > + pci_set_master(pdev);
> > + pci_dev_unlock(pdev);
> > + }
>
> I thought we talked about this, the iommu driver cannot just blindly
> issue a reset like this, the reset has to come from the actual device
> driver through the AERish mechanism. Otherwise the driver RAS is going
> to explode.
>
> The smmu driver should immediately block the STE (reject translated
> requests) to protect the system before resuming whatever command
> submissio n has encountered the error.
>
> You could delegate the STE change to the interrupted command
> submission to avoid doing it from a ISR, that makes alot of sense
> because the submission thread is already operating a cmdq so it could
> stick in a STE invalidation command, possibly even in place of the
> failed ATC command.
You mean in arm_smmu_cmdq_issue_cmdlist() that issued the timed
out ATC command?
So my test case was to trigger a device fault followed by an ATC
command. But, I found that the ATC command submission returned 0
while only the ISR received:
CMDQ error (cons 0x03000003): ATC invalidate timeout
arm_smmu_debugfs_atc_write: ATC_INV ret=0
It seems difficult to insert a CMDQ_OP_CFGI_STE in the submission
thread?
> I think I'd break this up into smaller steps, just focus on this STE
> mechanism at start and have any future attach callback fix the STE.
>
> Then we can talk about how to properly trigger the PCI RAS flow and so
> on.
OK.
Thanks
Nicolin