Re: [PATCH v2 0/2] PCI/ASPM: Enable ASPM and Clock PM by default on devicetree platforms

From: Manivannan Sadhasivam

Date: Tue Nov 11 2025 - 05:20:11 EST


On Tue, Nov 11, 2025 at 04:40:01AM -0300, Val Packett wrote:
>
> On 11/11/25 4:19 AM, Manivannan Sadhasivam wrote:
> > On Tue, Nov 11, 2025 at 03:51:03AM -0300, Val Packett wrote:
> > > On 11/8/25 1:18 PM, Dmitry Baryshkov wrote:
> > > > On Mon, Sep 22, 2025 at 09:46:43PM +0530, Manivannan Sadhasivam via B4 Relay wrote:
> > > > > Hi,
> > > > >
> > > > > This series is one of the 'let's bite the bullet' kind, where we have decided to
> > > > > enable all ASPM and Clock PM states by default on devicetree platforms [1]. The
> > > > > reason why devicetree platforms were chosen because, it will be of minimal
> > > > > impact compared to the ACPI platforms. So seemed ideal to test the waters.
> > > > >
> > > > > This series is tested on Lenovo Thinkpad T14s based on Snapdragon X1 SoC. All
> > > > > supported ASPM states are getting enabled for both the NVMe and WLAN devices by
> > > > > default.
> > > > > [..]
> > > > The series breaks the DRM CI on DB820C board (apq8096, PCIe network
> > > > card, NFS root). The board resets randomly after some time ([1]).
> > > Is that reset.. due to the watchdog resetting a hard-frozen system?
> > >
> > > Me and a bunch of other people in the #aarch64-laptops irc/matrix room have
> > > been experiencing these random hard freezes with ASPM enabled for the NVMe
> > > SSD, on Hamoa (and Purwa too I think) devices.
> > >
> > Interesting! ASPM is tested and found to be working on Hamoa and other Qcom
> > chipsets also, except Makena based chipsets that doesn't support L0s due to
> > incorrect PHY settings. APQ8096 might be an exception since it is a really old
> > target and I'm digging up internally regarding the ASPM support.
> >
> > > Totally unpredictable, could be after 4 minutes or 4 days of uptime.
> > > Panic-indicator LED not blinking, no reaction to magic SysRq, display image
> > > frozen, just a complete hang until the watchdog does the reset.
> > >
> > I have KIOXIA SSD on my T14s. I do see some random hang, but I thought those
> > predate the ASPM enablement as I saw them earlier as well. But even before this
> > series, we had ASPM enabled for SSDs on Qcom targets (or devices that gets
> > enumerated during initial bus scan), so it might be that the SSD doesn't support
> > ASPM well enough.
>
> I certainly remember that ASPM *was* enabled by default when I first got
> this laptop, via the custom way that predates this series.
>
> Actually that custom enablement code getting removed was how I discovered it
> was ASPM related!
>
> I pulled linux-next once and suddenly the system became stable!.. and then I
> noticed +2W of battery drain..
>

Because, we only enable L0s and L1 by default and not L1ss.

> > But I'm clueless on why it results in a hang. What I know on ARM platforms is
> > that we get SError aborts and other crazy bus/NOC issues if the device doesn't
> > respond to the PCIe read request. So the hang could be due to one of those
> > issues.
>
> Could the kernel be making requests before the device fully resumed from a
> sleep state?
>

Kernel has no visibility on the PCIe link ASPM states as it happens autonomously
in hardware once enabled. So once kernel issues a PCIe read TLP, the link is
supposed to transition L0 and the device should respond. But if the link doesn't
come up for any reason, it will result in a completion timeout and weird things
happen on the host.

> > > I have confirmed with a modified (to accept args) enable-aspm.sh script[1]
> > > that disabling ASPM *only* for the SSD, while keeping it *on* for the WiFi
> > > adapter, is enough to keep the system stable (got to about a month of uptime
> > > in that state).
> > >
> > So this confirms that the controller supports it, and the device (SSD) might be
> > of fault here.
> >
> > > If you have reproduced the same issue on an entirely different SoC, it's
> > > probably a general driver issue.
> > >
> > > Please, please help us debug this using your internal secret debug equipment
> > > :)
> > >
> > Starting from v6.18-rc3, we only enable L0s and L1 by default on all devicetree
> > platforms. Are you seeing the hangs post -rc3 also? If so, could you please
> > share the SSD model by doing 'lspci -nn'?
>
> Yes, still seeing them on 6.18.0-rc4-next-20251107. At least with
> pcie_aspm=force (have been using that recently, so likely all my testing
> "post -rc3" was with force on.. but others have been testing without it)
>

pcie_aspm=force will forcefully enable all the ASPM states. So it will result in
the same crash if L1ss is not supported properly by the endpoint.

> I'm currently using the stock drive: Sandisk Corp PC SN740 NVMe SSD
> (DRAM-less) [15b7:5015] (rev 01)
>

I'm suspecting the L1ss issue with this SSD since you said above that
next/master works fine until you pass 'pcie_aspm=force'. Could you try the below
diff with that cmdline option?

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 44e780718953..ba48f8184b68 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -2525,6 +2525,16 @@ static void quirk_disable_aspm_l0s_l1(struct pci_dev *dev)
*/
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASMEDIA, 0x1080, quirk_disable_aspm_l0s_l1);

+static void quirk_disable_aspm_l1ss(struct pci_dev *dev)
+{
+ pci_info(dev, "Disabling ASPM L1ss\n");
+ pci_disable_link_state(dev, PCIE_LINK_STATE_L1_1 |
+ PCIE_LINK_STATE_L1_2 |
+ PCIE_LINK_STATE_L1_1_PCIPM |
+ PCIE_LINK_STATE_L1_2_PCIPM);
+}
+DECLARE_PCI_FIXUP_FINAL(0x15b7, 0x5015, quirk_disable_aspm_l1ss);
+
/*
* Remove ASPM L0s and L1 support from cached copy of Link Capabilities so
* aspm.c won't try to enable them.

> Though for a couple months I've used a 3rd party one, an SK Hynix BC901
> [1c5c:1d59]
>
> And other users have different other models and still have the same issue.
>
> // Every time something PCIe related is posted to the mailing lists I've
> been wondering if it could solve this :D
> "Program correct T_POWER_ON value for L1.2 exit timing" didn't help. Testing
> "Remove DPC Extended Capability" now..
>

You could've reported this issue to linux-pci list.

- Mani

--
மணிவண்ணன் சதாசிவம்