sb_edac.c lacks PCI domain support?

From: Bjorn Helgaas
Date: Wed Aug 08 2018 - 16:07:52 EST


I think sb_edac.c (and probably other EDAC stuff) lacks PCI domain
support. I notice messages like this:

[ 14.370256] pci 0000:ff:13.5: [8086:6fad] type 00 class 0x088000
[ 14.980481] pci 0000:bf:13.5: [8086:6fad] type 00 class 0x088000
[ 15.590646] pci 0000:7f:13.5: [8086:6fad] type 00 class 0x088000
[ 16.200498] pci 0000:3f:13.5: [8086:6fad] type 00 class 0x088000
[ 17.928243] pci 0001:ff:13.5: [8086:6fad] type 00 class 0x088000
[ 18.538876] pci 0001:bf:13.5: [8086:6fad] type 00 class 0x088000
[ 19.149211] pci 0001:7f:13.5: [8086:6fad] type 00 class 0x088000
[ 19.759431] pci 0001:3f:13.5: [8086:6fad] type 00 class 0x088000
...
[ 54.298058] EDAC sbridge: Duplicated device for 8086:6fad
[ 54.298062] EDAC sbridge: Failed to register device with error -19.

on a large system (see [1]). It looks like sbridge_get_onedevice()
looks up things based on the PCI bus number, but it ignores the PCI
domain (aka segment) number, and I suspect it thinks 0000:ff:13.5 and
0001:ff:13.5 are duplicates.

sbridge_get_all_devices
while (...)
do
sbridge_get_onedevice
pdev = pci_get_device(...)
sbridge_dev = get_sbridge_dev(pdev->bus->number, ...)
if (sbridge_dev->pdev[sbridge_dev->i_devs])
printk("Duplicated device ...")
return -ENODEV # -19
while (pdev ...)

It looks like 88ae80aa609c ("EDAC, skx_edac: Handle systems with
segmented PCI busses") fixes a similar problem; maybe that should
be applied elsewhere in EDAC as well?

Why doesn't EDAC use the standard pci_register_driver() interface?
That would avoid issues like this. It would also avoid the potential
conflict of another driver operating on the device at the same time.

[1] https://bugzilla.kernel.org/attachment.cgi?id=277759 (attachment
to unrelated bug https://bugzilla.kernel.org/show_bug.cgi?id=200765)