Re: [PATCH v2 0/7] PCI: Solve two bridge window sizing issues

From: Ilpo Järvinen
Date: Thu Apr 11 2024 - 06:42:21 EST


On Tue, 9 Apr 2024, Jonathan Cameron wrote:
> On Thu, 28 Dec 2023 18:57:00 +0200
> Ilpo Järvinen <ilpo.jarvinen@xxxxxxxxxxxxxxx> wrote:
>
> > Here's a series that contains two fixes to PCI bridge window sizing
> > algorithm. Together, they should enable remove & rescan cycle to work
> > for a PCI bus that has PCI devices with optional resources and/or
> > disparity in BAR sizes.
> >
> > For the second fix, I chose to expose find_empty_resource_slot() from
> > kernel/resource.c because it should increase accuracy of the cannot-fit
> > decision (currently that function is called find_resource()). In order
> > to do that sensibly, a few improvements seemed in order to make its
> > interface and name of the function sane before exposing it. Thus, the
> > few extra patches on resource side.
> >
> > Unfortunately I don't have a reason to suspect these would help with
> > the issues related to the currently ongoing resource regression
> > thread [1].
> >
> > [1] https://lore.kernel.org/linux-pci/ZXpaNCLiDM+Kv38H@xxxxxxxxxxxxxxxxxxx/
> >
>
> Hi Ilpo
>
> I've hit what looks to be a similar issue to this (not fixed by this series :()
>
> QEMU setup with a CXL PCI Expander bridge a single RP and a 4 port
> port switch with CXL Type 3 devices below ports 0 and 1. (2 and 3 are empty but
> each has a single BAR). RP and DSP on the switch all hotplug capable bridges.
>
> Note that if I touch almost anything about the configuration it all works, but
> this particular combination doesn't. I can add a 3rd empty port or remove 1
> or add or remove an EP below the switch and it all succeeds.
>
> arm64 setup and I'd rather not set the DSM to not reenumerate (because
> there should be no problem doing so.
> Note that arm64 support in general isn't upstream in qemu yet (at least partly
> because we haven't figure out how to do PXB bus enumeration on DT) but can
> be found at gitlab.com/jic23/qemu (not including the vsec additions to expand
> the available CRS entries for the root bridge)
>
> I've added EDK2 support and the vsec structures to expand the windows massively
> so there 'should' be no issue with tight fits. However, the large CRS
> entries for the root bridge don't seem to get picked up.
> I did some digging and the 0c bus has those windows, but not the 0c:00.0
> (the root port). I can't work out how extra space is supposed to get distributed
> to root ports.
>
> Problem chunk with the kernel enumeration is that the first CXL type3 device
> has 3 bars, but the range has been shrunken to the point where only one of them
> fits.

Hi Jonathan,

I'm not entirely sure about the scenario here, because you didn't exactly
explain which steps you used before you the no space for BAR problem
triggers.

The main goal of this patchset is to help in cases where something is
first working as expected, then removed and rescanned, it is not expected
to help on initial enumeration/cases where there is no "first working"
condition (it might help even there, under some scenarios but that's just
"bonus", not the original goal I had in mind).

That being said, there is still caveats even when rescanning. In
multiple children/multilevel cases, the allocation still greedily follows
the current policy and only tries to use the minimal strategy when the
resource can no longer fit. Because the greedy strategy is the default,
the first allocations can consume the space that would be needed later and
all allocations should have been done with minimal strategy but due to
online nature of the resource allocation algorithm it's not able to
correct its mistake at that point.

(The above by no means implies I wouldn't have interest in looking into
and addressing other problems besides what this patchset tries to solve.)

> pci 0000:0f:00.0: BAR 2 [mem 0x20000000-0x2003ffff 64bit]: assigned
> pci 0000:0f:00.0: BAR 0 [mem size 0x00010000 64bit]: can't assign; no space
> pci 0000:0f:00.0: BAR 0 [mem size 0x00010000 64bit]: failed to assign
> pci 0000:0f:00.0: BAR 4 [mem size 0x00001000]: can't assign; no space
> pci 0000:0f:00.0: BAR 4 [mem size 0x00001000]: failed to assign
> pci 0000:0e:00.0: PCI bridge to [bus 0f]
> pci 0000:0e:00.0: bridge window [mem 0x20000000-0x2003ffff]
> pci 0000:0e:00.0: bridge window [mem 0x20c00000-0x20cfffff 64bit pref]
> pci 0000:10:00.0: BAR 2 [mem 0x20100000-0x2013ffff 64bit]: assigned
> pci 0000:10:00.0: BAR 0 [mem 0x20140000-0x2014ffff 64bit]: assigned
> pci 0000:10:00.0: BAR 4 [mem 0x20150000-0x20150fff]: assigned
> pci 0000:0e:01.0: PCI bridge to [bus 10]
> pci 0000:0e:01.0: bridge window [mem 0x20100000-0x2017ffff]
> pci 0000:0e:01.0: bridge window [mem 0x20d00000-0x20dfffff 64bit pref]
>
> AS you can see the window for that first device is simply too small.

The challenge in tracking the allocations is that the sizes are calculated
long before the allocations occurs so it's pretty hard to track things
back into the root cause. And there's also the intermediate step too which
tries to fit the optional (add) sizes.

> EDK2 ends up with a /proc/iomap of
> (kernel hacked as if the DSM was there to prevent reenumeration.
>
> 0b000080-0b0000ff : PRP0001:00
> 10000000-1fffffff : PCI Bus 0000:00
> 10000000-101fffff : PCI Bus 0000:0p1
> 10200000-1023ffff : 0000:00:03.0
> 10240000-10240fff : 0000:00:03.0
> 10241000-10241fff : 0000:00:02.0
> 10242000-10242fff : 0000:00:01.0
> 20000000-381fffff : PCI Bus 0000:0c
> 20000000-2fffffff : PCI Bus 0000:0d
> 20000000-2fffffff : PCI Bus 0000:0e
> 20000000-23ffffff : PCI Bus 0000:12
> 24000000-27ffffff : PCI Bus 0000:11
> 28000000-2bffffff : PCI Bus 0000:10
> 2c000000-2fffffff : PCI Bus 0000:0f
> 30000000-381fffff : PCI Bus 0000:0d
> 30000000-380fffff : PCI Bus 0000:0e
> 30000000-31ffffff : PCI Bus 0000:12
> 32000000-33ffffff : PCI Bus 0000:11
> 34000000-35ffffff : PCI Bus 0000:10
> 34000000-3403ffff : 0000:10:00.0
> 34000080-34000087 : 0000:10:00.0
> 34000088-340008a7 : 0000:10:00.0
> 340008a8-340008af : 0000:10:00.0
> 34010000-34010dff : 0000:10:00.0
> 34020000-34020dff : 0000:10:00.0
> 34040000-3404ffff : 0000:10:00.0
> 34041080-340410d7 : 0000:10:00.0
> 34041128-340411b7 : endpoint4
> 34050000-34050fff : 0000:10:00.0
> 36000000-37ffffff : PCI Bus 0000:0f
> 36000000-3603ffff : 0000:0f:00.0
> 36000080-36000087 : 0000:0f:00.0
> 36000088-360008a7 : 0000:0f:00.0
> 360008a8-360008af : 0000:0f:00.0
> 36010000-36010dff : 0000:0f:00.0
> 36020000-36020dff : 0000:0f:00.0
> 36040000-3604ffff : 0000:0f:00.0
> 36041080-360410d7 : 0000:0f:00.0
> 36041128-360411b7 : endpoint3
> 36050000-36050fff : 0000:0f:00.0
> 38000000-3801ffff : 0000:0e:00.0
> 38001080-380010d7 : mem0
> 38020000-3803ffff : 0000:0e:01.0
> 38021080-380210d7 : mem1
> 38040000-3805ffff : 0000:0e:02.0
> 38060000-3807ffff : 0000:0e:03.0
> 38100000-3811ffff : 0000:0d:00.0
> 38101128-381011b7 : port2
> 38200000-3efeffff : PCI Bus 0000:00
> 40000000-b9d7ffff : System RAM
>
> With the kernel rerunning.
>
> 10000000-1fffffff : PCI Bus 0000:00
> 10000000-101fffff : PCI Bus 0000:01
> 10200000-1023ffff : 0000:00:03.0
> 10240000-10240fff : 0000:00:01.0
> 10241000-10241fff : 0000:00:02.0
> 10242000-10242fff : 0000:00:03.0
> 20000000-381fffff : PCI Bus 0000:0c
> 20000000-20bfffff : PCI Bus 0000:0d
> 20000000-20afffff : PCI Bus 0000:0e
> 20000000-2003ffff : PCI Bus 0000:0f
> 20000000-2003ffff : 0000:0f:00.0
> 20000080-20000087 : 0000:0f:00.0
> 20000088-200008a7 : 0000:0f:00.0
> 200008a8-200008af : 0000:0f:00.0
> 20010000-20010dff : 0000:0f:00.0
> 20020000-20020dff : 0000:0f:00.0
> 20040000-2005ffff : 0000:0e:00.0
> 20060000-2007ffff : 0000:0e:01.0
> 20061080-200610d7 : mem1
> 20080000-2009ffff : 0000:0e:02.0
> 200a0000-200bffff : 0000:0e:03.0
> 20100000-2017ffff : PCI Bus 0000:10
> 20100000-2013ffff : 0000:10:00.0
> 20100080-20100087 : 0000:10:00.0
> 20100088-201008a7 : 0000:10:00.0
> 201008a8-201008af : 0000:10:00.0
> 20110000-20110dff : 0000:10:00.0
> 20120000-20120dff : 0000:10:00.0
> 20140000-2014ffff : 0000:10:00.0
> 20141080-201410d7 : 0000:10:00.0
> 20141128-201411b7 : endpoint3
> 20150000-20150fff : 0000:10:00.0
> 20200000-203fffff : PCI Bus 0000:11
> 20400000-205fffff : PCI Bus 0000:12
> 20b00000-20b1ffff : 0000:0d:00.0
> 20b01128-20b011b7 : port2
> 20c00000-217fffff : PCI Bus 0000:0d
> 20c00000-217fffff : PCI Bus 0000:0e
> 20c00000-20cfffff : PCI Bus 0000:0f
> 20d00000-20dfffff : PCI Bus 0000:10
> 20e00000-20efffff : PCI Bus 0000:11
> 20f00000-20ffffff : PCI Bus 0000:12
>
> All suggestions welcome. I've tried to figure out what is going on but
> beyond thinking that the
> 20000000-381fffff : PCI Bus 0000:0c
> entry isn't being distributed, I'm drawing a blank.

I think it would be generally useful in tracking these problems to have
something actually logged about these resource decisions. E.g., there are
zero pci_dbg()s in pci_bus_distribute_available_resources(). So unless the
window is adjusted, we have zero information on what's going on so no
surprise why everyone is "drawing a blank". :-(

What often comes into play in some scenarios is that once a bridge
window is assigned (res->parent is non-zero), it won't be changed. This
leads to some code not working as per what likely was the original intent.
Also adjust_bridge_window() seems to check for res->parent as the first
thing, maybe check if that prevents distributing the window?

Below is a patch which adds a bit of logging into
pci_bus_distribute_available_resources().

--
i.

From: Ilpo Järvinen <ilpo.jarvinen@xxxxxxxxxxxxxxx>
Subject: [PATCH] PCI: Log bridge window distribute value

pci_bus_distribute_available_resources() is currently very silent, add
debugging print about the distribute decision.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@xxxxxxxxxxxxxxx>

---
drivers/pci/setup-bus.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 909e6a7c3cc3..7a455d75e657 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -1917,6 +1917,11 @@ static void pci_bus_distribute_available_resources(struct pci_bus *bus,
normal_bridges);
}

+ pci_dbg(bridge, "distribute to bridges: %u (hp) + %u, io: %llx mem: %llx mem pref: %llx\n",
+ hotplug_bridges, normal_bridges,
+ (unsigned long long)io_per_b, (unsigned long long)mmio_per_b,
+ (unsigned long long)mmio_pref_per_b);
+
for_each_pci_bridge(dev, bus) {
struct resource *res;
struct pci_bus *b;

--
tg: (4cece7649650..) log/distribute (depends on: main)