[PATCH 1/2] PCI: fix serious bug when sizing bridges with additional size

From: Nicholas Johnson
Date: Wed Feb 06 2019 - 05:25:29 EST


Background: I discovered a bug while preparing my next patch for
Thunderbolt native PCI enumeration. This bugfix is vital for that next
Thunderbolt patch. They are related and could be put together but I am
presenting them as separate patches.

Nature of problem: On boot, the PCI bridges are handled in the function
pci_assign_unassigned_root_bus_resources(). This function makes multiple
passes / attempts to assign all of the resources. It calls
__pci_bus_size_bridges() which tries to assign required resources, plus
additional requested space. It makes calls to pbus_size_io() and
pbus_size_mem() which return zero on success. However, these functions
misleadingly return non-zero if the resource is already assigned.

The fallout: If the first attempt to allocate resources on a bridge
succeeds in 64-bit MMIO_PREF, but fails in 32-bit MMIO, then a second
attempt is made. Then, the __pbus_size_mem() returns non-zero for
MMIO_PREF because the resource is already assigned. That makes
__pci_bus_size_bridges() think there is no space (ENOSPC) and it
attempts to assign the sum of the MMIO_PREF and MMIO sizes into a 32-bit
window. This means that it tries to request the sum of the the MMIO and
MMIO_PREF size into the MMIO window. In the best case, the MMIO size is
equal to the MMIO_PREF size and we get double the window in MMIO, and
the correct size in MMIO_PREF, and nothing bad happens. In a worse case,
doubling the size makes it fail due to lack of space in the miniscule
MMIO 32-bit address space. In that case, we might be able to reduce the
requested size. In the worst case, we have requested for a vast amount
of MMIO_PREF, and a small amount of MMIO. The kernel tries to assign the
sum of the sizes of the two types in the MMIO window, leaving us with
MMIO_PREF assigned and MMIO unassigned due to the requested size being
too large to assign. In this case, there is nothing we can do about it.
If a massive additional MMIO_PREF and a small non-zero amount of
additional MMIO are needed, then we cannot reduce the requested
MMIO_PREF to work around the bug.

The cause: pbus_size_io() and pbus_size_mem() both call
find_free_bus_resource() to identify which bridge window of the bus
meets the criteria. This function explicitly skips resources with a
parent, inevitably returning NULL. The functions pbus_size_io() and
pbus_size_mem() return -ENOSPC if find_free_bus_resource() returns NULL,
meaning an assigned resource is interpreted as having been failed to
assign, potentially causing it to try to allocate it again in another
window.

The solution: my proposed patch renames find_free_bus_resource() to
find_bus_resource_of_type() and modifies it to not skip resources with
r->parent. The name change is because returning an assigned resource
makes the resource potentially not "free". The calling functions,
pbus_size_io() and pbus_size_mem() have checks added for b_res->parent
and they return 0 (success) in this case. This is because a resource
with a parent is already assigned and should be interpreted as a
success, not a failure.

Testing: I have tested my proposed patch and it appears to work
flawlessly. It has the added benefit of slashing the number of attempts
taken to assign a given BAR, meaning a much cleaner dmesg. In fact, in
my configuration, it has gone from very messy to mainly IO BAR failures.
There is nothing that can be done about IO BARs because of the 16-bit
hardware limitation, meaning the amount of available space cannot be
increased. All that I am left with is a few "failed to assign [mem
size]" very early in boot, before those BARs succeed in the next
attempt. After the small initial set of failures for MEM, there are no
more "failed to assign" whatsoever for non-IO BARs.

Remarks: I fixed some other block comments to comply with kernel code
styling standards. I had to modify the one block comment to reflect the
changes, and checkpatch.pl got really angry and yelled at me. So I
figured I may as well fix the lot.

Signed-off-by: Nicholas Johnson <nicholas.johnson-opensource@xxxxxxxxxxxxxx>
---
drivers/pci/setup-bus.c | 82 +++++++++++++++++++++++++----------------
1 file changed, 51 insertions(+), 31 deletions(-)

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index a8be1a90f..c7e0a5e2b 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -563,17 +563,19 @@ void pci_setup_cardbus(struct pci_bus *bus)
}
EXPORT_SYMBOL(pci_setup_cardbus);

-/* Initialize bridges with base/limit values we have collected.
- PCI-to-PCI Bridge Architecture Specification rev. 1.1 (1998)
- requires that if there is no I/O ports or memory behind the
- bridge, corresponding range must be turned off by writing base
- value greater than limit to the bridge's base/limit registers.
-
- Note: care must be taken when updating I/O base/limit registers
- of bridges which support 32-bit I/O. This update requires two
- config space writes, so it's quite possible that an I/O window of
- the bridge will have some undesirable address (e.g. 0) after the
- first write. Ditto 64-bit prefetchable MMIO. */
+/*
+ * Initialize bridges with base/limit values we have collected.
+ * PCI-to-PCI Bridge Architecture Specification rev. 1.1 (1998)
+ * requires that if there is no I/O ports or memory behind the
+ * bridge, corresponding range must be turned off by writing base
+ * value greater than limit to the bridge's base/limit registers.
+ *
+ * Note: care must be taken when updating I/O base/limit registers
+ * of bridges which support 32-bit I/O. This update requires two
+ * config space writes, so it's quite possible that an I/O window of
+ * the bridge will have some undesirable address (e.g. 0) after the
+ * first write. Ditto 64-bit prefetchable MMIO.
+ */
static void pci_setup_bridge_io(struct pci_dev *bridge)
{
struct resource *res;
@@ -636,9 +638,11 @@ static void pci_setup_bridge_mmio_pref(struct pci_dev *bridge)
struct pci_bus_region region;
u32 l, bu, lu;

- /* Clear out the upper 32 bits of PREF limit.
- If PCI_PREF_BASE_UPPER32 was non-zero, this temporarily
- disables PREF range, which is ok. */
+ /*
+ * Clear out the upper 32 bits of PREF limit.
+ * If PCI_PREF_BASE_UPPER32 was non-zero, this temporarily
+ * disables PREF range, which is ok.
+ */
pci_write_config_dword(bridge, PCI_PREF_LIMIT_UPPER32, 0);

/* Set up PREF base/limit. */
@@ -730,9 +734,11 @@ int pci_claim_bridge_resource(struct pci_dev *bridge, int i)
return -EINVAL;
}

-/* Check whether the bridge supports optional I/O and
- prefetchable memory ranges. If not, the respective
- base/limit registers must be read-only and read as 0. */
+/*
+ * Check whether the bridge supports optional I/O and
+ * prefetchable memory ranges. If not, the respective
+ * base/limit registers must be read-only and read as 0.
+ */
static void pci_bridge_check_ranges(struct pci_bus *bus)
{
u16 io;
@@ -752,9 +758,11 @@ static void pci_bridge_check_ranges(struct pci_bus *bus)
if (io)
b_res[0].flags |= IORESOURCE_IO;

- /* DECchip 21050 pass 2 errata: the bridge may miss an address
- disconnect boundary by one PCI data phase.
- Workaround: do not use prefetching on this device. */
+ /*
+ * DECchip 21050 pass 2 errata: the bridge may miss an address
+ * disconnect boundary by one PCI data phase.
+ * Workaround: do not use prefetching on this device.
+ */
if (bridge->vendor == PCI_VENDOR_ID_DEC && bridge->device == 0x0001)
return;

@@ -789,11 +797,17 @@ static void pci_bridge_check_ranges(struct pci_bus *bus)
}
}

-/* Helper function for sizing routines: find first available
- bus resource of a given type. Note: we intentionally skip
- the bus resources which have already been assigned (that is,
- have non-NULL parent resource). */
-static struct resource *find_free_bus_resource(struct pci_bus *bus,
+/*
+ * Helper function for sizing routines: find first bus resource of a
+ * given type. Note: we do not skip the bus resources which have already
+ * been assigned (r->parent != NULL). This is because a resource that is
+ * already assigned (nothing more to be done) will be indistinguishable
+ * from one that failed due to lack of space if we skip assigned
+ * resources. If the caller function cannot tell the difference then it
+ * might try to place the resources in a different window, doubling up on
+ * resources or causing unforeseeable issues.
+ */
+static struct resource *find_bus_resource_of_type(struct pci_bus *bus,
unsigned long type_mask, unsigned long type)
{
int i;
@@ -802,7 +816,7 @@ static struct resource *find_free_bus_resource(struct pci_bus *bus,
pci_bus_for_each_resource(bus, r, i) {
if (r == &ioport_resource || r == &iomem_resource)
continue;
- if (r && (r->flags & type_mask) == type && !r->parent)
+ if (r && (r->flags & type_mask) == type)
return r;
}
return NULL;
@@ -820,8 +834,10 @@ static resource_size_t calculate_iosize(resource_size_t size,
size = min_size;
if (old_size == 1)
old_size = 0;
- /* To be fixed in 2.5: we should have sort of HAVE_ISA
- flag in the struct pci_bus. */
+ /*
+ * To be fixed in 2.5: we should have sort of HAVE_ISA
+ * flag in the struct pci_bus.
+ */
#if defined(CONFIG_ISA) || defined(CONFIG_EISA)
size = (size & 0xff) + ((size & ~0xffUL) << 2);
#endif
@@ -900,14 +916,16 @@ static void pbus_size_io(struct pci_bus *bus, resource_size_t min_size,
resource_size_t add_size, struct list_head *realloc_head)
{
struct pci_dev *dev;
- struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO,
- IORESOURCE_IO);
+ struct resource *b_res = find_bus_resource_of_type(bus, IORESOURCE_IO,
+ IORESOURCE_IO);
resource_size_t size = 0, size0 = 0, size1 = 0;
resource_size_t children_add_size = 0;
resource_size_t min_align, align;

if (!b_res)
return;
+ if (b_res->parent)
+ return;

min_align = window_alignment(bus, IORESOURCE_IO);
list_for_each_entry(dev, &bus->devices, bus_list) {
@@ -1012,7 +1030,7 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
resource_size_t min_align, align, size, size0, size1;
resource_size_t aligns[18]; /* Alignments from 1Mb to 128Gb */
int order, max_order;
- struct resource *b_res = find_free_bus_resource(bus,
+ struct resource *b_res = find_bus_resource_of_type(bus,
mask | IORESOURCE_PREFETCH, type);
resource_size_t children_add_size = 0;
resource_size_t children_add_align = 0;
@@ -1020,6 +1038,8 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,

if (!b_res)
return -ENOSPC;
+ if (b_res->parent)
+ return 0;

memset(aligns, 0, sizeof(aligns));
max_order = 0;
--
2.19.1