Re: NMI stack overflow during resume of PCIe bridge with CONFIG_HARDLOCKUP_DETECTOR=y

From: Jason Gunthorpe

Date: Tue Jan 13 2026 - 16:15:52 EST


On Tue, Jan 13, 2026 at 08:30:46PM +0100, Thomas Gleixner wrote:
> So gradually your machine just stalls on outstanding MMIO transactions
> w/o further notice... The NMI is just a red herring.

CPUs usualy have timeouts for these things and they return 0xFF back
for the timed out read. Beyond that "it depends" if any other RAS
indications are raised.

> You need to figure out why that MMIO access to that device's
> configuration space stalls as anything else is just subsequent
> damage.

Given this is a resume it seems likely the PCI routing inside the
bridge chip has been messed up somehow during the suspend/resume.

Possibily due to errata in the bridge, there are many weird bridge
errata :\

Jason