Re: [External] : Re: [PATCH 2/2] PCI: Fix the PCIe bridge decreasing to Gen 1 during hotplug testing

From: ALOK TIWARI
Date: Tue Nov 25 2025 - 14:24:35 EST


Hi,

On 1/15/2025 3:48 PM, Lukas Wunner wrote:
On Tue, Jan 14, 2025 at 08:25:04PM +0200, Ilpo Järvinen wrote:
On Tue, 14 Jan 2025, Jiwei wrote:
[ 539.362400] ==== pcie_bwnotif_irq 269(stop running),link_status:0x7841
[ 539.395720] ==== pcie_bwnotif_irq 247(start running),link_status:0x1041

DLLLA=0

But LBMS did not get reset.

So is this perhaps because hotplug cannot keep up with the rapid
remove/add going on, and thus will not always call the remove_board()
even if the device went away?

Lukas, do you know if there's a good way to resolve this within hotplug
side?

I believe the pciehp code is fine and suspect this is an issue
in the quirk. We've been dealing with rapid add/remove in pciehp
for years without issues.

I don't understand the quirk sufficiently to make a guess
what's going wrong, but I'm wondering if there could be
a race accessing the lbms_count?

Maybe if lbms_count is replaced by a flag in pci_dev->priv_flags
as we've discussed, with proper memory barriers where necessary,
this problem will solve itself?

Thanks,

Lukas


We are testing hot-add/hot-remove behavior and observed the same issue as, mentioned where the PCIe bridge link speed drops from 32 GT/s to 2.5 GT/s.

My understanding is that pcie_failed_link_retrain should only apply to devices matched by PCI_VDEVICE(ASMEDIA, 0x2824),
but the current implementation appears to affect all devices that take longer to establish a link.
We are unsure if this is intentional, but it effectively allows such
devices to continue operating at a reduced speed.

If we extend PCIE_LINK_RETRAIN_TIMEOUT_MS to 3000 ms, these slower devices are able to complete link training,
and the problem is no longer observed in our testing. Therefore, increasing PCIE_LINK_RETRAIN_TIMEOUT_MS to 3000 ms seems to resolve the issue for us.

Would it be acceptable to increase PCIE_LINK_RETRAIN_TIMEOUT_MS, from 1000 to 3000 ms in this case?


Thanks,
Alok