Re: [PATCH] PCI/portdrv: Enable error reporting on managed ports

From: Bjorn Helgaas
Date: Tue Oct 09 2018 - 13:56:37 EST


On Tue, Sep 04, 2018 at 12:33:09PM -0600, Jon Derrick wrote:
> During probe, the port driver will disable error reporting and assumes
> it will be enabled later by the AER driver's pci_walk_bus() sequence.
> This may not be the case for host-bridge enabled root ports, who will
> enable first error reporting on the bus during the root port probe, and
> then disable error reporting on downstream devices during subsequent
> probing of the bus.

I understand the hotplug case (see below), but help me understand this
"host-bridge enabled root ports" thing. I'm not sure what that means.

We run pcie_portdrv_probe() for every root port, switch upstream port,
and switch downstream port, and it always disables error reporting for
the port:

pcie_portdrv_probe # pci_driver .probe
pcie_port_device_register
get_port_device_capability
services |= PCIE_PORT_SERVICE_AER
pci_disable_pcie_error_reporting
# clear DEVCTL Error Reporting Enables

For root ports, we call aer_probe(), and it enables error reporting
for the entire tree below the root port:

aer_probe # pcie_port_service .probe
aer_enable_rootport
set_downstream_devices_error_reporting(dev, true)
pci_walk_bus(dev->subordinate, set_device_error_reporting)
set_device_error_reporting
if (Root Port || Upstream Port || Downstream Port)
pci_enable_pcie_error_reporting
# set DEVCTL Error Reporting Enables

This is definitely broken for hot-added switches because aer_probe()
is the only place we enable error reporting, and it's only run when we
enumerate a root port, not when we hot-add things below that root
port.

> A hotplugged port device may also fail to enable error reporting as the
> AER driver has already run on the root bus.

> Check for these conditions and enable error reporting during portdrv
> probing.
>
> Example case:

pcie_portdrv_probe(10000:00:00.0):
> [ 343.790573] pcieport 10000:00:00.0: pci_disable_pcie_error_reporting

aer_probe(10000:00:00.0):
> [ 343.809812] pcieport 10000:00:00.0: pci_enable_pcie_error_reporting
> [ 343.819506] pci 10000:01:00.0: pci_enable_pcie_error_reporting
> [ 343.828814] pci 10000:02:00.0: pci_enable_pcie_error_reporting
> [ 343.838089] pci 10000:02:01.0: pci_enable_pcie_error_reporting
> [ 343.847478] pci 10000:02:02.0: pci_enable_pcie_error_reporting
> [ 343.856659] pci 10000:02:03.0: pci_enable_pcie_error_reporting
> [ 343.865794] pci 10000:02:04.0: pci_enable_pcie_error_reporting
> [ 343.874875] pci 10000:02:05.0: pci_enable_pcie_error_reporting
> [ 343.883918] pci 10000:02:06.0: pci_enable_pcie_error_reporting
> [ 343.892922] pci 10000:02:07.0: pci_enable_pcie_error_reporting

pcie_portdrv_probe(10000:01:00.0):
> [ 343.918900] pcieport 10000:01:00.0: pci_disable_pcie_error_reporting

pcie_portdrv_probe(10000:02:00.0):
> [ 343.968426] pcieport 10000:02:00.0: pci_disable_pcie_error_reporting

...
> [ 344.028179] pcieport 10000:02:01.0: pci_disable_pcie_error_reporting
> [ 344.091269] pcieport 10000:02:02.0: pci_disable_pcie_error_reporting
> [ 344.156473] pcieport 10000:02:03.0: pci_disable_pcie_error_reporting
> [ 344.238042] pcieport 10000:02:04.0: pci_disable_pcie_error_reporting
> [ 344.321864] pcieport 10000:02:05.0: pci_disable_pcie_error_reporting
> [ 344.411601] pcieport 10000:02:06.0: pci_disable_pcie_error_reporting
> [ 344.505332] pcieport 10000:02:07.0: pci_disable_pcie_error_reporting

> [ 344.621824] nvme 10000:06:00.0: pci_enable_pcie_error_reporting
>
> Signed-off-by: Jon Derrick <jonathan.derrick@xxxxxxxxx>
> ---
> drivers/pci/pcie/portdrv_core.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
> index 7c37d81..fdd953a 100644
> --- a/drivers/pci/pcie/portdrv_core.c
> +++ b/drivers/pci/pcie/portdrv_core.c
> @@ -343,6 +343,16 @@ int pcie_port_device_register(struct pci_dev *dev)
> if (!nr_service)
> goto error_cleanup_irqs;
>
> +#ifdef CONFIG_PCIEAER
> + /*
> + * Enable error reporting for this port in case AER probing has already
> + * run on the root bus or this port device is hot-inserted
> + */
> + if (dev->aer_cap && pci_aer_available() &&
> + (pcie_ports_native || pci_find_host_bridge(dev->bus)->native_aer))
> + pci_enable_pcie_error_reporting(dev);
> +#endif

I plan to apply this after we clarify the changelog a bit, but I don't
really like this patch because it (and the corresponding code added by
2bd50dd800b5 ("PCI: PCIe: Disable PCIe port services during port
initialization")) seem a little out of place.

The way I think this *should* work is that the PCI core should arrange to
handle AER interrupts when it enumerates the devices that can generate
them (Root Ports and Root Complex Event Collectors), even before it
enumerates the devices below the Root Port.

Then the PCI core could directly enable the AER interrupts on all devices
as it enumerates them. I would envision both cases being handled somewhere
like pci_aer_init() in pci_init_capabilities().

This would also allow us to get rid of the pci_enable_pcie_error_reporting()
calls that are currently sprinkled around in drivers, because that would be
handled by the core for all devices.

Bjorn