Re: Why is the ARM SMMU v1/v2 put into bypass mode on kexec?

From: Tyler Hicks
Date: Tue Mar 19 2024 - 15:15:04 EST


On 2024-03-19 15:47:56, Will Deacon wrote:
> On Tue, Mar 19, 2024 at 12:57:52PM +0000, Robin Murphy wrote:
> > Beyond properly quiescing and resetting the system back to a boot-time
> > state, the outgoing kernel in a kexec can only really do things which affect
> > itself. Sure, we *could* configure the SMMU to block all traffic and disable
> > the interrupt to avoid getting stuck in a storm of faults on the way out,
> > but what does that mean for the incoming kexec payload? That it can have the
> > pleasure of discovering the SMMU, innocently enabling the interrupt and
> > getting stuck in an unexpected storm of faults. Or perhaps just resetting
> > the SMMU into a disabled state and thus still unwittingly allowing its
> > memory to be corrupted by the previous kernel not supporting kexec properly.
>
> Right, it's hard to win if DMA-active devices weren't quiesced properly
> by the outgoing kernel. Either the SMMU was left in abort (leading to the
> problems you list above) or the SMMU is left in bypass (leading to possible
> data corruption). Which is better?

My thoughts are that a loud and obvious failure (via unidentified stream
fault messages and/or a possible interrupt storm preventing the new
kernel from booting) is favorable to silent and subtle data corruption
of the target kernel.

> The best solution is obviously to implement those missing ->shutdown()
> callbacks.

Completely agree here but it can be difficult to even identify that a
missing ->shutdown hook is the root cause without code changes to put
the SMMU into abort mode and sleep for a bit in the SMMU's ->shutdown
hook.

Tyler