Re: [PATCH v2 0/2] PCI/ASPM: Enable ASPM and Clock PM by default on devicetree platforms

From: Val Packett
Date: Tue Nov 11 2025 - 02:40:37 EST



On 11/11/25 4:19 AM, Manivannan Sadhasivam wrote:
On Tue, Nov 11, 2025 at 03:51:03AM -0300, Val Packett wrote:
On 11/8/25 1:18 PM, Dmitry Baryshkov wrote:
On Mon, Sep 22, 2025 at 09:46:43PM +0530, Manivannan Sadhasivam via B4 Relay wrote:
Hi,

This series is one of the 'let's bite the bullet' kind, where we have decided to
enable all ASPM and Clock PM states by default on devicetree platforms [1]. The
reason why devicetree platforms were chosen because, it will be of minimal
impact compared to the ACPI platforms. So seemed ideal to test the waters.

This series is tested on Lenovo Thinkpad T14s based on Snapdragon X1 SoC. All
supported ASPM states are getting enabled for both the NVMe and WLAN devices by
default.
[..]
The series breaks the DRM CI on DB820C board (apq8096, PCIe network
card, NFS root). The board resets randomly after some time ([1]).
Is that reset.. due to the watchdog resetting a hard-frozen system?

Me and a bunch of other people in the #aarch64-laptops irc/matrix room have
been experiencing these random hard freezes with ASPM enabled for the NVMe
SSD, on Hamoa (and Purwa too I think) devices.

Interesting! ASPM is tested and found to be working on Hamoa and other Qcom
chipsets also, except Makena based chipsets that doesn't support L0s due to
incorrect PHY settings. APQ8096 might be an exception since it is a really old
target and I'm digging up internally regarding the ASPM support.

Totally unpredictable, could be after 4 minutes or 4 days of uptime.
Panic-indicator LED not blinking, no reaction to magic SysRq, display image
frozen, just a complete hang until the watchdog does the reset.

I have KIOXIA SSD on my T14s. I do see some random hang, but I thought those
predate the ASPM enablement as I saw them earlier as well. But even before this
series, we had ASPM enabled for SSDs on Qcom targets (or devices that gets
enumerated during initial bus scan), so it might be that the SSD doesn't support
ASPM well enough.

I certainly remember that ASPM *was* enabled by default when I first got this laptop, via the custom way that predates this series.

Actually that custom enablement code getting removed was how I discovered it was ASPM related!

I pulled linux-next once and suddenly the system became stable!.. and then I noticed +2W of battery drain..

But I'm clueless on why it results in a hang. What I know on ARM platforms is
that we get SError aborts and other crazy bus/NOC issues if the device doesn't
respond to the PCIe read request. So the hang could be due to one of those
issues.

Could the kernel be making requests before the device fully resumed from a sleep state?

I have confirmed with a modified (to accept args) enable-aspm.sh script[1]
that disabling ASPM *only* for the SSD, while keeping it *on* for the WiFi
adapter, is enough to keep the system stable (got to about a month of uptime
in that state).

So this confirms that the controller supports it, and the device (SSD) might be
of fault here.

If you have reproduced the same issue on an entirely different SoC, it's
probably a general driver issue.

Please, please help us debug this using your internal secret debug equipment
:)

Starting from v6.18-rc3, we only enable L0s and L1 by default on all devicetree
platforms. Are you seeing the hangs post -rc3 also? If so, could you please
share the SSD model by doing 'lspci -nn'?

Yes, still seeing them on 6.18.0-rc4-next-20251107. At least with pcie_aspm=force (have been using that recently, so likely all my testing "post -rc3" was with force on.. but others have been testing without it)

I'm currently using the stock drive: Sandisk Corp PC SN740 NVMe SSD (DRAM-less) [15b7:5015] (rev 01)

Though for a couple months I've used a 3rd party one, an SK Hynix BC901 [1c5c:1d59]

And other users have different other models and still have the same issue.

// Every time something PCIe related is posted to the mailing lists I've been wondering if it could solve this :D
"Program correct T_POWER_ON value for L1.2 exit timing" didn't help. Testing "Remove DPC Extended Capability" now..


~val