Re: [PATCH] PCI: Always lift 2.5GT/s restriction in PCIe failed link retraining

Next message: Dmitry Baryshkov: "Re: [PATCH RFC 06/18] accel/qda: Add memory manager for CB devices"
Previous message: Bill Wendling: "Re: [PATCH] blkdev: Annotate struct request_queue with __counted_by_ptr"
In reply to: Bjorn Helgaas: "Re: [PATCH] PCI: Always lift 2.5GT/s restriction in PCIe failed link retraining"
Next in thread: Maciej W. Rozycki: "Re: [PATCH] PCI: Always lift 2.5GT/s restriction in PCIe failed link retraining"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Matthew W Carlis

Date: Mon Feb 23 2026 - 17:50:06 EST

I wonder if the compromise here isn't adding a new kind of
DECLARE_PCI_FIXUP_<LINKFAIL?> & putting this quirk behind it?

On Thu, 19 Feb 2026 16:53:22 -0600, Bjorn Helgaas wrote:
> I would like it much better if it's possible to limit it to
> devices with known defects.

On Fri, 20 Feb 2026 12:03:17 +0000, Maciej W. Rozycki wrote:
> As I say it's logically impossible to figure out whether or not to apply
> such a workaround where the culprit is the downstream device

I don't think we're looking at an impossible decision to make here in terms
of whether to apply the quirk. It makes the most sense in my mind to restrict
the quirk to that Asmedia device which was upstream of the pericom switch in
the initial bug report iirc.

1) There was never a root cause so we can't say that this is in fact an issue
with the pericom switch... It could be the ASmedia switch causing the problem..
2) Even if it is a bug in the pericom, applying to only the ASmedia switch limits the
blast radius of the retrain action. It becomes unlikely that we ever see any
reports of issues from large scale users (hyper scalers, server vendors etc).

On Mon, 23 Feb 2026 11:36:03 -0600, Bjorn Helgaas wrote:
> IIUC Matthew [1] and Alok [2] have reported issues that only happen
> when we run pcie_failed_link_retrain(). The issues seem to be with
> NVMe devices, but I don't see a root cause or a solution (other than
> skipping pcie_failed_link_retrain()).

I don't think the issue is really specific to NVMe devices realistically what
is happening is that NVMe devices happen to be hot-plug'ed way more often than
any other kind of PCIe device. Vendors who design devices/systems for hot-plug
will also do extensive hot-plug testing & due to limitations of the kernel's
heuristics, invoke the quirk when it is not desirable to do so.

-Matt