Re: [PATCH v9 6/9] PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller

From: Niklas Schnelle
Date: Fri Dec 06 2024 - 18:07:02 EST


On Fri, 2024-12-06 at 21:07 +0100, Niklas Schnelle wrote:
> On Fri, 2024-12-06 at 20:31 +0100, Niklas Schnelle wrote:
> > On Fri, 2024-12-06 at 19:12 +0100, Niklas Schnelle wrote:
> > > On Fri, 2024-10-18 at 17:47 +0300, Ilpo Järvinen wrote:
> > > > This mostly reverts the commit b4c7d2076b4e ("PCI/LINK: Remove
> > > > bandwidth notification"). An upcoming commit extends this driver
> > > > building PCIe bandwidth controller on top of it.
> > > >
> > > > The PCIe bandwidth notification were first added in the commit
> > > > e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth
> > > > notification") but later had to be removed. The significant changes
> > > > compared with the old bandwidth notification driver include:
> > > >
> > ---8<---
> > > > ---
> > >
> > > Hi Ilpo,
> > >
> > > I bisected a v6.13-rc1 boot hang on my personal workstation to this
> > > patch. Sadly I don't have much details like a panic or so because the
> > > boot hangs before any kernel messages, or at least they're not visible
> > > long enough to see. I haven't yet looked into the code as I wanted to
> > > raise awareness first. Since the commit doesn't revert cleanly on
> > > v6.13-rc1 I also haven't tried that yet.
> > >
> > > Here are some details on my system:
> > > - AMD Ryzen 9 3900X
> > > - ASRock X570 Creator Motherboard
> > > - Radeon RX 5600 XT
> > > - Intel JHL7540 Thunderbolt 3 USB Controller (only USB 2 plugged)
> > > - Intel 82599 10 Gigabit NIC with SR-IOV enabled with 2 VFs
> > > - Intel n I211 Gigabit NIC
> > > - Intel Wi-Fi 6 AX200
> > > - Aquantia AQtion AQC107 NIC
> > >
> > > If you have patches or things to try just ask.
> > >
> > > Thanks,
> > > Niklas
> > >
> >
> > Ok I can now at least confirm that bluntly disabling the new bwctrl
> > driver with the below diff on top of v6.13-rc1 circumvents the boot
> > hang I'm seeing. So it's definitely this.
> >
> > diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> > index 5e10306b6308..6fa54480444a 100644
> > --- a/drivers/pci/pcie/portdrv.c
> > +++ b/drivers/pci/pcie/portdrv.c
> > @@ -828,7 +828,7 @@ static void __init pcie_init_services(void)
> > pcie_aer_init();
> > pcie_pme_init();
> > pcie_dpc_init();
> > - pcie_bwctrl_init();
> > + /* pcie_bwctrl_init(); */
> > pcie_hp_init();
> > }
> >
>
> Also here is the full lspci -vvv output running the above on v6.13-rc1:
> https://paste.js.org/9UwQIMp7eSgp
>
> Also note that I have CONFIG_PCIE_THERMAL unset so it's also not the
> cooling device portion that's causing the issue. Next I guess I should
> narrow it down to the specific port where enabling the bandwidth
> monitoring is causing trouble, not yet sure how best to do this with
> this many devices.
>
> Thanks,
> Niklas

Ok did some fiddeling and it's the thunderbolt ports. The below diff
works around the issue. That said I guess for a proper fix this would
should get filtered by the port service matching? Also as can be seen
in lspci the port still claims to support bandwidth management so maybe
other thunderbolt ports actually do.

Thanks,
Niklas

diff --git a/drivers/pci/pcie/bwctrl.c b/drivers/pci/pcie/bwctrl.c
index b59cacc740fa..76a14f959c7f 100644
--- a/drivers/pci/pcie/bwctrl.c
+++ b/drivers/pci/pcie/bwctrl.c
@@ -294,6 +294,9 @@ static int pcie_bwnotif_probe(struct pcie_device *srv)
struct pci_dev *port = srv->port;
int ret;

+ if (srv->port->is_thunderbolt)
+ return -EOPNOTSUPP;
+
struct pcie_bwctrl_data *data = devm_kzalloc(&srv->device,
sizeof(*data), GFP_KERNEL);
if (!data)