Re: crash during resume of PCIe bridge from v5.17 to next-20260130 (v5.16 works)
From: Rafael J. Wysocki
Date: Sun Feb 01 2026 - 06:42:50 EST
On Sun, Feb 1, 2026 at 11:20 AM Armin Wolf <W_Armin@xxxxxx> wrote:
>
> Am 01.02.26 um 01:36 schrieb Bert Karwatzki:
>
> > I found the error, the commit
> > ("drm/amd: Check if ASPM is enabled from PCIe subsystem")
> > has been applied twice first as cba07cce39ac and a second time
> > as 7294863a6f01 after it had been superseeded by commit
> > 0ab5d711ec74 ("drm/amd: Refactor `amdgpu_aspm` to be evaluated per device")
> > This effectively disables ASPM globally after the built-in GPU (which does not
> > support ASPM) is probed. This is the reason for the crashes and loss of devices
> > errors which on average occur after ~1000 resumes of the discrete GPU.
> >
> > snippet from git log --oneline drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c in linux-next:
> > 158a05a0b885 drm/amdgpu: Add use_xgmi_p2p module parameter
> > 7294863a6f01 drm/amd: Check if ASPM is enabled from PCIe subsystem <--- This does not belong here!
> > b784f42cf78b drm/amdgpu: drop testing module parameter
> > 0b1a63487b0f drm/amdgpu: drop benchmark module parameter
> > cec2cc7b1c4a drm/amdgpu: Fix typo in *whether* in comment
> > 0ab5d711ec74 drm/amd: Refactor `amdgpu_aspm` to be evaluated per device <--- This removes the code from the previous commit.
> > cba07cce39ac drm/amd: Check if ASPM is enabled from PCIe subsystem <--- The first time the commit was applied.
> > dfcc3e8c24cc drm/amdgpu: make cyan skillfish support code more consistent
> >
> > The fix is simply to revert commit 7294863a6f01.
> >
> > I sent a patch for linux-next (unfortunately without CC'ing stable) and a seperate patch for
> > v6.18.8, I hope this does not cause confusion ...
> >
> > Bert Karwatzki
>
> Good work! Thank you for researching the faulty commit that lead to this strange behavior.
Yes, nice work, thanks!
I wish all of the reporters of kernel issues were so persistent.