Re: Why is the ARM SMMU v1/v2 put into bypass mode on kexec?

From: Will Deacon
Date: Fri Mar 22 2024 - 12:06:57 EST


On Tue, Mar 19, 2024 at 02:14:26PM -0500, Tyler Hicks wrote:
> On 2024-03-19 15:47:56, Will Deacon wrote:
> > On Tue, Mar 19, 2024 at 12:57:52PM +0000, Robin Murphy wrote:
> > > Beyond properly quiescing and resetting the system back to a boot-time
> > > state, the outgoing kernel in a kexec can only really do things which affect
> > > itself. Sure, we *could* configure the SMMU to block all traffic and disable
> > > the interrupt to avoid getting stuck in a storm of faults on the way out,
> > > but what does that mean for the incoming kexec payload? That it can have the
> > > pleasure of discovering the SMMU, innocently enabling the interrupt and
> > > getting stuck in an unexpected storm of faults. Or perhaps just resetting
> > > the SMMU into a disabled state and thus still unwittingly allowing its
> > > memory to be corrupted by the previous kernel not supporting kexec properly.
> >
> > Right, it's hard to win if DMA-active devices weren't quiesced properly
> > by the outgoing kernel. Either the SMMU was left in abort (leading to the
> > problems you list above) or the SMMU is left in bypass (leading to possible
> > data corruption). Which is better?
>
> My thoughts are that a loud and obvious failure (via unidentified stream
> fault messages and/or a possible interrupt storm preventing the new
> kernel from booting) is favorable to silent and subtle data corruption
> of the target kernel.

Looking at the SMMUv3 spec, the architecture does actually allow hardware
to reset into an aborting state:

[GBPA.ABORT]
| Note: An implementation can reset this field to 1, in order to
| implement a default deny policy on reset.

so perhaps it's not that unreasonable. I just dread the flood of emails
I'll get because the SMMU driver is noisy due to missing ->shutdown()
callbacks elsewhere :/

> > The best solution is obviously to implement those missing ->shutdown()
> > callbacks.
>
> Completely agree here but it can be difficult to even identify that a
> missing ->shutdown hook is the root cause without code changes to put
> the SMMU into abort mode and sleep for a bit in the SMMU's ->shutdown
> hook.

Perhaps that's the thing to tackle first, then? If we make it easier for
folks to diagnose and fix the missing ->shutdown() callbacks, then going
into abort is much more reasonable,

Will