Re: Peer bridge fixup issue under multiple pci domain

From: Bjorn Helgaas
Date: Tue Aug 28 2018 - 15:35:24 EST

[+cc EDAC folks, LKML]

On Sat, Aug 25, 2018 at 10:58:57PM +0800, Zihan Yang wrote:
> Hi all,
> I'm trying to use multiple pci domain in qemu q35, but I find there
> might be some issues in peer bridge fixup.
> In short, pcibios_fixup_peer_bridges function assumes only one pci
> domain (0) by default. This is OK when as qemu by default uses only
> one pci domain too. However, if I add another host bridge which is
> put into pci domain 1 by using _SEG, and a pcie_pci_bridge is attached
> to the bus 1 under this new pci domain 1 rather than domain 0, the
> kernel will recognize the bus 01 differently.
> More specifically, pcibios_fixup_peer_bridges only reads all the buses
> under domain 0 but it can read the pci bus 01 in pci domain 1 and treat
> it as a peer bus of 0000:00. The consequence is this 01 bus is recognized
> as 0000:01, but it should have been recognized as 0001:01.
> The host bus 0001:00 can be recognized so I guess pcibios_fixup_peer_bridges
> needs updating to take care of multiple domains? Or is it just an bios issue?
> I'm not quite sure and I'm open to any suggestions.

Is there something that actually does not work, or is this just a
concern that the code looks wrong?

pcibios_fixup_peer_bridges() is ancient history from before x86 used
the ACPI namespace to discover host bridges. It blindly probes for
devices on buses 0-255, but as you say, only in domain 0.

Using multiple PCI domains really requires ACPI support so we know
what the other domains are (_SEG) and how to access their config space
(MCFG). When we do have ACPI support in the platform and the kernel,
drivers/acpi/pci_root.c discovers all the host bridges in all domains
via PNP0A03 or PNP0A08 devices in the ACPI namespace, and in most
cases pcibios_fixup_peer_bridges() will do nothing.

However, there *are* systems where the firmware does not expose all
host bridges and in those cases, pcibios_fixup_peer_bridges() can be a
problem. For example, Intel processors often have management devices
on bus 7f or ff. If the ACPI namespace doesn't have a host bridge to
those buses, pci_root.c won't find them, but
pcibios_fixup_peer_bridges() *will*.

This leads to several problems. Here's a dmesg sample from [1]
(found by googling for 'dmesg log "PCI: discovered peer bus ff"'):

ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
PCI: Discovered peer bus fe
pci_bus 0000:fe: root bus resource [io 0x0000-0xffff]
pci_bus 0000:fe: root bus resource [mem 0x00000000-0xffffffffff]
pci 0000:fe:03.0: [8086:2d98] type 00 class 0x060000
PCI: Discovered peer bus ff
pci_bus 0000:ff: root bus resource [io 0x0000-0xffff]
pci_bus 0000:ff: root bus resource [mem 0x00000000-0xffffffffff]
pci 0000:ff:03.0: [8086:2d98] type 00 class 0x060000
EDAC MC1: Giving out device to module i7core_edac.c controller i7 core #1: DEV 0000:fe:03.0 (INTERRUPT)
EDAC PCI0: Giving out device to module i7core_edac controller EDAC PCI controller: DEV 0000:fe:03.0 (POLLED)
EDAC MC0: Giving out device to module i7core_edac.c controller i7 core #0: DEV 0000:ff:03.0 (INTERRUPT)
EDAC PCI1: Giving out device to module i7core_edac controller EDAC PCI controller: DEV 0000:ff:03.0 (POLLED)

Some of the problems are:

- Firmware may have omitted the host bridges to [bus fe] and
[bus ff] from the ACPI namespace because *it* is using those
management devices, so EDAC blindly using them is a potential

- pcibios_fixup_peer_bridges() only scans domain 0, so if this
system had multiple domains, EDAC would only work on things in
domain 0, ignoring other domains.

- The PCI core can't do bus number assignment correctly for devices
behind bridge PCI0. The firmware told us [bus 00-ff] was
available, so the core may assign bus number fe to some deep
switch hierarchy. But bus fe conflicts with the devices on the
"peer bus fe". This part is a firmware bug: it should have told
us that PCI0 leads to [bus 00-fd], not [bus 00-ff].

- The PCI core can't do resource assignment correctly for devices on
[bus fe] and [bus ff]. It has no information about what MMIO and
I/O port are routed to those buses, so it assumes *all* memory and
I/O ports are routed there, which is clearly incorrect. This part
is a Linux bug; we really shouldn't be poking around for buses
that ACPI didn't tell us about.