Re: [PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts
From: Pranjal Shrivastava
Date: Tue Mar 10 2026 - 16:05:08 EST
On Tue, Mar 10, 2026 at 12:57:30PM -0700, Nicolin Chen wrote:
> On Tue, Mar 10, 2026 at 07:40:56PM +0000, Pranjal Shrivastava wrote:
> > On Fri, Mar 06, 2026 at 09:02:02AM -0400, Jason Gunthorpe wrote:
> > > On Thu, Mar 05, 2026 at 09:06:17PM -0800, Nicolin Chen wrote:
> > > > On Thu, Mar 05, 2026 at 09:33:47PM -0400, Jason Gunthorpe wrote:
> > > > > On Thu, Mar 05, 2026 at 05:29:22PM -0800, Nicolin Chen wrote:
> > > > >
> > > > > > But arm_smmu_cmdq_issue_cmdlist() doesn't know when to push another
> > > > > > CMD. In my case where ATC_INV irq occurs, the return value from the
> > > > > > arm_smmu_cmdq_poll_until_sync() in the Step 5 is 0, and prods/cons
> > > > > > are also matched. Actually, at this point that NOP ISR has already
> > > > > > finished.
> > > > >
> > > > > Yes, you'd need a sneaky way to convay the error from the ISR to the
> > > > > cmdlist code that didn't harm performance. Maybe we could come up with
> > > > > something, but if it works replacing the NOP with flush sounds fairly
> > > > > appealing - though can you do a single WORD edit to the STE that will
> > > > > block translated requests? Zero EATS?
> > > >
> > > > Yea. I can give that a try.
> > >
> > > This also really needs to go after the invalidation changes because it
> > > is feasible to also edit the lockless RCU invalidation list from the
> > > ISR and disable the ATC for the failed device too.
> > >
> > > > > Also, will the SMMU start spamming with blocked translation events or
> > > > > something that will need suppression too?
> > > >
> > > > CD.R=0 can suppress fault records, but we would need to override
> > > > that in every CD of the device.
> > >
> > > That's too much to do from ISR, but maybe we can do it from a WQ..
> > >
> >
> > (Skimming through these, apologies if I'm losing context), shouldn't we
> > do all that (marking it as an inv STE / abort STE, suppressing the
> > faults) in the worker instead of trying to reset/recover the device?
>
> EATS should be unset asap to avoid memory corruption. It's best
> to do in the unmap() context where the page isn't reclaimed yet
> by the kernel.
>
Makes sense.
> Worker thread will be a bit late, but it is good enough for any
> further step.
>
Ack.
Praan