Re: [PATCH V2 4/9] PCI/AER: Extend AER error handling to RCECs

From: Bjorn Helgaas
Date: Mon Aug 17 2020 - 18:24:39 EST


On Mon, Aug 10, 2020 at 10:32:52AM +0100, Jonathan Cameron wrote:
> On Fri, 7 Aug 2020 17:55:17 -0700
> Sean V Kelley <sean.v.kelley@xxxxxxxxx> wrote:
> > On 7 Aug 2020, at 15:53, Bjorn Helgaas wrote:
> > > On Tue, Aug 04, 2020 at 12:40:47PM -0700, Sean V Kelley wrote:

> > >> if (!(pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT ||
> > >> - pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM))
> > >> + pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM ||
> > >> + pci_pcie_type(dev) == PCI_EXP_TYPE_RC_END ||
> > >> + pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC))
> > >> dev = dev->bus->self;

I'm not sure I understand this "if" statement. Previously (with no
RCEC support), the possible ways I see to call pcie_do_recovery() are
with:

AER native: Root Port
AER via APEI: Root Port or other PCIe device (ACPI v6.3, 18.3.2.5)
DPC: Root Port or Switch Downstream Port
EDR: Root Port or Switch Downstream Port

I *guess* the reason we have this "if" statement is for the AER/APEI
case? And the effect is that even if AER/APEI gives us an Endpoint,
we back up and handle it as though we got it from the Downstream Port
above it, i.e., we reset the Endpoint along with any other children of
that Downstream Port?

Then, IIUC, your patches add this case:

AER native: Root Port or RCEC
AER via APEI: Root Port, RCEC, or other PCIe device

Just noodling here, but I wonder if this would be more understandable
as something like:

type = pci_pcie_type(dev);
if (type == PCI_EXP_TYPE_ROOT_PORT ||
type == PCI_EXP_TYPE_DOWNSTREAM ||
type == PCI_EXP_TYPE_RC_EC)
bridge = dev;
else if (type == PCI_EXP_TYPE_RC_END)
bridge = dev->rcec;
else
bridge = pci_upstream_bridge(dev);

and then we could do:

if (type == PCI_EXP_TYPE_RC_END)
flr_on_rciep(dev);
else
reset_link(bridge);

It's still awkward to have to deal with being supplied either
endpoints or bridges. But I guess in the AER/APEI case, we aren't
allowed to touch the error registers so maybe we can't avoid the
awkwardness.

> > >> status = reset_link(dev);
> > >
> > > reset_link() might be misnamed. IIUC "dev" is a bridge, and the point
> > > is really to reset any devices below "dev." Whether we do that by
> > > resetting link, DPC trigger, secondary bus reset, FLR, etc, is sort of
> > > immaterial. Some of those methods might be applicable for RCiEPs.
> > >
> > > But you didn't add that name; I'm just trying to understand this
> > > better.
> >
> > Yes, that’s a confusing term with the _link attached. It’s difficult
> > to relate to the different resets that might be applicable. I was
> > thinking about that when looking at the callback path via the
> > “reset_link” of the RCiEP to the RCEC for the sole purpose of
> > clearing the Root Port Error Status. It would be worth time to spend
> > looking at better descriptive naming/methods.
>
> Agreed, this caused me some some confusion as well so more descriptive
> naming would be good.

Maybe something like reset_subordinate_devices()? Then it's clear
that we pass a bridge and reset the devices *below* it. It's not
quite as obvious for RCECs, since they aren't bridges and the RCiEPs
aren't actually *subordinates*, but maybe it's still suggestive of the
logical relationship?

Bjorn