Re: NMI stack overflow during resume of PCIe bridge with CONFIG_HARDLOCKUP_DETECTOR=y
From: Jason Gunthorpe
Date: Tue Jan 13 2026 - 16:15:52 EST
On Tue, Jan 13, 2026 at 08:30:46PM +0100, Thomas Gleixner wrote:
> So gradually your machine just stalls on outstanding MMIO transactions
> w/o further notice... The NMI is just a red herring.
CPUs usualy have timeouts for these things and they return 0xFF back
for the timed out read. Beyond that "it depends" if any other RAS
indications are raised.
> You need to figure out why that MMIO access to that device's
> configuration space stalls as anything else is just subsequent
> damage.
Given this is a resume it seems likely the PCI routing inside the
bridge chip has been messed up somehow during the suspend/resume.
Possibily due to errata in the bridge, there are many weird bridge
errata :\
Jason