RE: [PATCH rc v2 0/5] iommu/arm-smmu-v3: Fix device crash on kdump kernel
From: Tian, Kevin
Date: Fri Apr 17 2026 - 03:55:10 EST
> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Friday, April 17, 2026 1:20 AM
>
> On Thu, Apr 16, 2026 at 05:49:24PM +0100, Robin Murphy wrote:
> > On 15/04/2026 10:17 pm, Nicolin Chen wrote:
> > > When transitioning to a kdump kernel, the primary kernel might have
> crashed
> > > while endpoint devices were actively bus-mastering DMA. Currently, the
> SMMU
> > > driver aggressively resets the hardware during probe by clearing
> CR0_SMMUEN
> > > and setting the Global Bypass Attribute (GBPA) to ABORT.
> > >
> > > In a kdump scenario, this aggressive reset is highly destructive:
> > > a) If GBPA is set to ABORT, in-flight DMA will be aborted, generating fatal
> > > PCIe AER or SErrors that may panic the kdump kernel
> > > b) If GBPA is set to BYPASS, in-flight DMA targeting some IOVAs will
> bypass
> > > the SMMU and corrupt the physical memory at those 1:1 mapped
> IOVAs.
> >
> > But wasn't that rather the point? Th kdump kernel doesn't know the scope
> of
> > how much could have gone wrong (including potentially the SMMU
> configuration
> > itself), so it just blocks everything, resets and reenables the devices it
> > cares about, and ignores whatever else might be on fire.
>
> The purpose of kdump is to have the maximum chance to capture a dump
> from the blown up kernel.
>
> Yes, on a perfect platform aborting the entire SMMU should improve the
> chance of getting that dump.
>
> But sadly there are so many busted up platforms where if you start
> messing with the IOMMU they will explode and blow up the kdump. x86
> and "firmware first" error handling systems are particularly notorious
> for nasty behavior like this.
>
> Seems like there are now ARM systems too. :(
is there any report on such systems? It might be informational to include
a link to the report so it's clear that this series fixes real issues instead of
a preparation for coming systems...
>
> So, the iommu drivers have been preserving the IOMMU and not
> disrupting the DMAs on x86 for a long time. This is established kdump
> practice.
>
> > If AER can panic a kdump kernel, that seems like a failing of the kdump
> > kernel itself more than anything else (especially given the likelihood that
> > additional AER events could follow from whatever initial crash/failure
> > triggered kdump to begin with).
>
> Probably the kdump wasn't triggered by AER. You want kdump to not
> trigger more RAS events that might blow up the kdump while it is
> trying to run.. That increases the chance of success
>
btw the DMA is allowed after the previous kernel is hung til the point
where smmu driver blocks it. In cases where in-fly DMAs are considered
dangerous to kdump, this series just make it worse instead of creating
a new issue. While for majority other failures not related to DMAs,
unblocking then increases the chance of success...