Re: [PATCH 1/2] PCI: Prevent 64-bit resources from being counted in 32-bit bridge region

From: Logan Gunthorpe
Date: Mon Mar 04 2019 - 14:21:25 EST

On 2019-03-03 5:23 p.m., Bjorn Helgaas wrote:
> Sorry for the delay. This code gives a headache. I still remember
> my headache from the last time we touched it. Help me understand
> what's going on here.

Yes, this code gave me a headache debugging it too. And it's not the
first time I've tried to figure out what's going on with it because it
often just prints noisy messages that look like errors. I think I
understand it better now but it's something that's a bit fleeting and
easy to forget the details of. There may also be other solutions to this

> On Thu, Feb 14, 2019 at 10:00:27AM -0700, Logan Gunthorpe wrote:
>> The possible situations where this can happen will be quite varied and
>> depend highly on the exact hierarchy and how the realloc code ends up
>> trying to assign the regions. It's known to at least require a
>> large 64-bit BAR (>1GB) below a PCI bridge.
> I guess the bug is that some BAR or window is unset when we actually
> have space for it? We need to make this more concrete, e.g., with a
> minimal example of a failure case, and then connect this code change
> specifically with that.

The system we hit this bug on is quite large and complex with multiple
layers of switches though I suspect I might have seen it on a completely
different system but never had time to dig into it. I guess I could try
to find a case in which qemu can hit it.

> "Ignored BARs" doesn't seem like the best terminology here. Can we
> just say they're "unset" as you do for windows? Even that's a little
> squishy because there's really no such thing as a clearly "unset" or
> invalid value for a BAR. All we can say is that Linux *thinks* it's
> unset because it happens to be zero (technically still a valid BAR
> value) or it conflicts with another device.

Yes. I used the "ignored" term because that's the terminology lspci uses
when it sees a resource like this. I'm not sure I like the "unset" term
because, in fact, the BAR registers in the configuration space are
typically still set to whatever the bios set them to[*1]. It's just that
the kernel doesn't create a struct resource for them and thus you wont
see a corresponding entry in /sys/bus/pci/.../resources or
/proc/bus/pci/devices. For example, here's the lspci and hex dump of the
config space for an example case; as you can see Region 0, is "ignored",
but the BAR register is still set to 0xf7100000.

05:00.0 PCI bridge: PLX Technology, Inc. Device 8733 (rev ca) (prog-if
00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 24
NUMA node: 0
Region 0: Memory at <ignored> (32-bit, non-prefetchable)
Bus: primary=05, secondary=06, subordinate=0a, sec-latency=0

05:00.0 PCI bridge: PLX Technology, Inc. Device 8733 (rev ca)
00: b5 10 33 87 07 01 10 00 ca 00 04 06 08 00 81 00
10: 00 00 10 f7 00 00 00 00 05 06 0a 00 11 21 00 00 f7100000
20: f0 ff 00 00 01 00 f1 ff 20 02 00 00 7f 02 00 00

> Strictly speaking, the result is that we can't enable decoding for
> that BAR type. Often that does mean the device is unusable, but in
> some cases, e.g., an I/O BAR being unset and a driver using
> pci_enable_device_mem(), the device *is* usable.

Yes, though I think this bug will always apply to non-prefetchable
32-bit MEM BARs and not I/O BARs. So if the driver needs such a BAR
(which I expect is common) it will not function.

> Surely realloc can fail even without a large 64-bit BAR? I don't
> think there's a magic threshold at 1GB. Maybe an example would
> illustrate the problem better.

No, to hit this bug, we do require a large 64-bit BAR. Though the
threshold isn't necessarily 1GB, its more complicated (see below).

This is just really hard to explain because the code is so tricky. I'll
try to clarify it more below. Once we can clear it up so you understand
it I'll try to update my commit message to be clearer.

> There are actually three calls to pbus_size_mem():
> 1) If bridge has a 64-bit prefetchable window, find the size of all
> 64-bit prefetchable resources below the bridge
> 2) If bridge has no 64-bit prefetchable window, find the size of all
> prefetchable resources below the bridge
> 3) Find the size of everything else (non-prefetchable resources plus
> any prefetchable ones that couldn't be accommodated above)

Yes, this is technically correct. I was just over-simplifying because,
to hit this bug, there must be a 64-bit prefetchable window and there
are no prefetchable 32-bit resources so (2) is irrelevant.

>> There are only two reasons for pbus_size_mem() to fail: if there is no
>> 64-bit/prefetchable bridge window, or if that window is already
>> assigned (in other words, its resource already has a parent set). We know
>> the former case can't be true because, in __pci_bus_size_bridges(), it's
>> existence is checked before making the call. So if the pbus_size_mem()
>> call in question fails, the window must already be assigned, and in this
>> case, we still do not want 64-bit resources trying to be sized into the
>> 32-bit catch-all resource.
> I guess this question of putting a 64-bit resource in the 32-bit
> non-prefetchable window (legal but undesirable) is a secondary thing,
> not the chief complaint you're fixing?

Uh, yeah no. The 64-bit resource(s) are typically correctly assigned to
the 64-bit window. However the bug causes victim 32-bit Non-prefetchable
resources to not be assigned because when we calculate the size the code
to inadvertently thinks the 64-bit resource must fit into the 32-bit
non-prefetchable window.


Ok, so let me try to walk through this a bit more step by step. Lets
make the following assumptions:

1) Let's say we have a bridge that has a 64-bit prefetchable window,
such that (b_res[2].flags & IORESOURCE_MEM_64) is true. So the bridge
has three windows: the I/O window, the non-preftechable 32-bit window
and the prefetchable 64-bit window.

2) Let's also say that, under the bridge, we have a resource that's
larger than we'd expect to fit into the 32-bit window. (So, the actual
limit depends on the maximum size of that window, which is hardware
dependant, and the size of everything else that goes in it, but for
simplicity I estimated this to be at least 1GB). For the purposes of
this example, lets say it's very large: 64GB.

3) Now, the tricky thing is that this code tries to do things in
multiple passes, unassigning resources that didn't seem to fit and
recalculating window sizes on multiple passes. So lets say we are on the
second pass where, previously, the 64-bit prefetchable window on this
bridge was successfully assigned but, for whatever reason, this bridge
failed and the resources were released (see
pci_bus_release_bridge_resources()). In this case (bres[2].parent !=
NULL) and the large 64-bit resource now has (res->parent == NULL).

Walking through __pci_bus_size_bridges():

i) Per (1), we do have a prefetchable window, so the first call to
pbus_size_mem() happens. However, the first thing that function does is
call find_free_bus_resource() which should find bres[2], but does not
because, per (3) its parent is not NULL (see the comment above
find_free_bus_resource() which makes it seem important that parent not
be set). Thus this call to pbus_size_mem() fails with -ENOSPC, and
'type2' and 'type3' remain unset.

ii) Seeing type2 is unset, we go to the second call to pbus_size_mem().
This call fails because per (1), there is no 32-bit prefetchable
resource to find. So find_free_bus_resource() fails, as it should.
'type2' and 'type3' are now set to IORESOURCE_MEM.

iii) Next we do the third call to pbus_size_mem() for the
non-prefetchable window, however, because the first two calls failed, it
will calculate the size for the 32-bit window to be greater than 64GB.

As the code recurses up into the rest of the PCI tree, nothing will fit
into the 32-bit window seeing it's calculated to be much larger than it
needs to be so none of the 32-bit prefetchable BARs will be assigned and
thus will not have struct resources and they will be reported as ignored
by lspci. Drivers that try to use these resources will also fail and
there will be addresses in the IOVA space that may get routed to the
wrong PCI address.

My solution to this bug was to notice that pbus_size_mem() can only fail
for one of two reasons: either the resource doesn't exist at all or it
is already assigned (ie. the resource's parent is still set). However,
for the first call in (i), we know the resource does exist because we
check for it (ie. the condition in (1)). Therefore we can say that if
pbus_size_mem() fails in (i) there is a 64-bit window and we should not
try to size the 32-bit window to fit 64-bit resources. We do this simply
by setting 'mask', 'type2' and 'type3' even when pbus_size_mem() fails.

Another potential solution would be to always unassign the windows for
the failing bridges by setting it's resources parents to NULL. But this
makes me much more nervous because I don't understand what all the
implications would be and the comment above find_free_bus_resource()
makes me think this is important.

Let me know if this makes more sense or you have further questions.

Also, the second patch in this series is a similar bug with *very*
similar symptoms but I think it's easier to understand. It's a
completely different cause and only happens on one particular piece of
hardware that I'm aware of; and the driver for that hardware doesn't use
the BAR at all right now so the only real negative effect is on the IOVA
address space.



[*1] This is probably another big bug due to its affect on the IOVA
address space, but I'm not sure what to do about it and with these
patches we don't have any further issues. However, maybe we should be
scanning the tree after everything is said and done and unset any BARs
that do not have corresponding resources. That way, if there are other
bugs that can cause ignored regions, or if our algorithm does something
that doesn't match the bios there aren't random failures when using the