Re: [PATCH 1/2] PCI: Correct error reporting with PCIe failed link retraining
From: Maciej W. Rozycki
Date: Fri Aug 09 2024 - 09:32:05 EST
On Wed, 24 Apr 2024, Bjorn Helgaas wrote:
> > linux-pcie-failed-link-retrain-status-fix.diff
> > Index: linux-macro/drivers/pci/quirks.c
> > ===================================================================
> > --- linux-macro.orig/drivers/pci/quirks.c
> > +++ linux-macro/drivers/pci/quirks.c
> > @@ -74,7 +74,8 @@
> > * firmware may have already arranged and lift it with ports that already
> > * report their data link being up.
> > *
> > - * Return TRUE if the link has been successfully retrained, otherwise FALSE.
> > + * Return TRUE if the link has been successfully retrained, otherwise FALSE,
> > + * also when retraining was not needed in the first place.
>
> Can you recast this? I think it's slightly unclear what is returned
> when retraining is not needed. I *think* you return FALSE when
> retraining is not needed. Maybe this?
>
> Return TRUE if the link has been successfully retrained. Return
> FALSE if retraining was not needed or we attempted a retrain and it
> failed.
Sure, thanks for the suggestion. Applied verbatim except for formatting.
> > @@ -83,10 +84,11 @@ bool pcie_failed_link_retrain(struct pci
> > {}
> > };
> > u16 lnksta, lnkctl2;
> > + bool ret = false;
> >
> > if (!pci_is_pcie(dev) || !pcie_downstream_port(dev) ||
> > !pcie_cap_has_lnkctl2(dev) || !dev->link_active_reporting)
> > - return false;
> > + return ret;
>
> We know the value here, so IMO it's easier to read if we return
> "false" instead of "ret".
Well, either patch in the series has to make this change. If you prefer
it to be the second one, then I'm fine with it. Applied throughout then.
> > @@ -117,13 +120,14 @@ bool pcie_failed_link_retrain(struct pci
> > lnkctl2 |= lnkcap & PCI_EXP_LNKCAP_SLS;
> > pcie_capability_write_word(dev, PCI_EXP_LNKCTL2, lnkctl2);
> >
> > - if (pcie_retrain_link(dev, false)) {
> > + ret = pcie_retrain_link(dev, false) == 0;
> > + if (!ret) {
> > pci_info(dev, "retraining failed\n");
> > - return false;
> > + return ret;
> > }
> > }
> >
> > - return true;
> > + return ret;
>
> It gets awfully subtle by the time we get here. I guess we could set
> a "retrain_attempted" flag above and do this:
>
> if (retrain_attempted)
> return true;
>
> return false;
>
> But I dunno if it's any better. I understand the need for a change
> like this, but the whole idea of returning failure (false) for a
> retrain failure and also for a "no retrain needed" is a little
> mind-bending.
I agree it's a bit subtle, but not incorrect and to go away with the next
change, which I want to keep separate from this one for clarity. Let's
not overdesign for this transitional state then please.
I've posted v2 now with extra patches to address further issues discussed
recently:
<https://patchwork.kernel.org/project/linux-pci/list/?series=878216>.
Thank you for your review!
Maciej