Re: [PATCH] pci: Account for virtual buses in pci_acs_path_enabled

From: Don Dutile
Date: Wed Aug 08 2012 - 15:33:03 EST


On 08/08/2012 11:24 AM, Alex Williamson wrote:
On Tue, 2012-08-07 at 23:00 -0700, Bjorn Helgaas wrote:
On Tue, Aug 7, 2012 at 2:50 PM, Don Dutile<ddutile@xxxxxxxxxx> wrote:
On 08/06/2012 04:47 PM, Bjorn Helgaas wrote:

On Sun, Aug 5, 2012 at 11:55 PM, Alex Williamson
<alex.williamson@xxxxxxxxxx> wrote:

On Sun, 2012-08-05 at 23:30 -0600, Bjorn Helgaas wrote:

On Sat, Aug 4, 2012 at 12:19 PM, Alex Williamson
<alex.williamson@xxxxxxxxxx> wrote:

It's possible to have buses without an associated bridge
(bus->self == NULL). SR-IOV can generate such buses. When
we find these, skip to the parent bus to look for the next
ACS test.


To make sure I understand the problem here, I think you're referring
to the situation where an SR-IOV device can span several bus numbers,
e.g., the "VFs Spanning Multiple Bus Numbers" implementation note in
the SR-IOV 1.1 spec, sec. 2.1.2.

It says "All PFs must be located on the Device's captured Bus Number"
-- I think that means every PF will be directly on a bridge's
secondary bus and hence will have a valid dev->bus->self pointer.

However, VFs need not be on the same bus number. If a VF is on
(captured Bus Number plus 1), I think we allocate a new struct pci_bus
for it, but there's no P2P bridge that leads to that bus, so the
bus->self pointer is probably NULL.


Yes, exactly. virtfn_add_bus() is where we're creating this new bus.

This makes me quite nervous, because I bet there are many places that
assume every non-root bus has a valid bus->self pointer -- I know I
certainly had that assumption.

I looked at callers of pci_is_root_bus(), and at first glance, it seems
like
iommu_init_device(), intel_iommu_add_device(), pci_acs_path_enabled(),



These 3 are handled by this patch, plus the intel and amd iommu patches
I sent.

pci_get_interrupt_pin(), pci_common_swizzle(),


If sr-iov is the only source of these virtual buses, these are probably
ok since VFs don't support INTx.

pci_find_upstream_pcie_bridge(), and


Here the pci_is_root_bus() is after a pci_is_pcie() check, so again if
sr-iov only (and assuming VFs properly report PCIe capability), we
shouldn't stumble on it.

pci_bus_release_bridge_resources() all might have similar problems.


This one might deserve further investigation. Thanks,


We can fix all these places piecemeal, but that doesn't feel like a
very satisfying solution. It makes it much harder to know that each
place is correct, and this oddity of a bus with no upstream bridge is
still lying around, waiting to bite us again later.

What other possible ways of fixing this do we have? Could we set
bus->self (multiple buses would then point to the same bridge, and I
don't know if that would break something)? Add something like a
pci_upstream_p2p_bridge() interface that would encapsulate traversing

^^^ and this name will reduce the confusion? :)

I don't claim that :) I just wanted to explore other possible
solutions. Changing every loop that searches the parent chain so it
knows about this SR-IOV oddity doesn't seem like the ideal solution,
though maybe it's the best we can do given the constraints.

Yep, these are the two alternatives I thought of too, set bus->self or
create a helper function. The former leaves present assumptions in
place, but I don't know whether we'll break something else having two
buses claiming to be sourced by the same device. Maintaining the NULL
self pointer almost feels like a better representation of the hardware,
but requires evaluating all the current users and coming up with a
helper. That's more work, but we probably want to move away from
everyone manually walking pci data structures anyway.

Since these fake VF buses don't have a bridge that points to them, I

Well, they aren't fake busses, just ARI-identifiers, which translate the
B:D.F/8:5.3
format to simply a 16-bit i.d.

I think an SR-IOV device can consume multiple bus numbers even without
ARI (in fact, I think ARI reduces the number of bus numbers the
device requires ... e.g., a PF and 15 VFs would require two bus
numbers without ARI (04:00.0 - 04:00.7 and 05:00.0 - 05:00.7) but only
one bus number with ARI (04:00.0 - 04:01.7)). (I think "04:01.7" is
how Linux would represent the 8-bit function number ARI gives you.
You could also think of it as "04:00.0f")

Yep, that's what I think is happening too. I've been testing on a
system with ARI, so using the same device as David, PFs at 1:00.0/1 and
VFs at 1:10.0 - 1:11.5. The sr-iov capability reports VF offset: 128,
stride: 2. IIRC, w/o ARI this same device reports VF offset 384 (256 +
128), which seems to match David's lspci.


Correct, lack of ARI forces the use of more bus numbers, not less.
It is 'fixed' by the iov.c code by increasing the sub_bus num register
to cover the range of busses the VF stride wants to use, if those bus-num's
aren't already consumed; if consumed, the VFs are inaccessible/unconfigurable.


So, VF devices should be attached to same bus->devices list as it's PF.

I don't think it works that way today, does it? In the SR-IOV spec
example in sec 2.1.2:

PF 0 at 04:00.0
ARI Capable is set
First VF Offset = 1, VF Stride = 1, NumVFs = 600

I think we have three separate bus->devices lists:

pci_bus 04: devices list contains PF 0 and VF 0,1 through VF 0,255
pci_bus 05: devices list contains VF 0,256 - VF 0,511
pci_bus 06: devices list contains VF 0,512 - VF 0,600

pci_dev->bus should be same bus ptr as PF's pci_dev as well, since the
VF uses all that's busses resources, support functions (cfg, dma-ops, etc.)
as well.
Searching the driver/pci area, support of functions like AER want the
bus struct that's receiving/handling the PCIe error, associated (hw) port,
etc.,
so another reason the VF's pci-dev bus ptr should be the same as the PF's.

Maybe every VF *should* have the same dev->bus pointer as the PF, but
I don't think it does today. I think we only store the bus number in
the struct pci_bus, so if we *did* give all the VFs the same dev->bus
pointer and put all the VFs in the same bus->devices list, we'd have
to store the bus number elsewhere, e.g., in the struct pci_dev.

That might make sense, but the magnitude of a change like that makes
my head hurt -- it would affect drivers, arch code, config accessors,
etc.

Yep, I agree, that's a total rework of what a struct pci_bus represents.
Thanks,

Alex


To sum up, I believe the sw structures should reflect the hw connectivity
and association. If you buy that bridge (all pun intended), then the VFs should
get connected to same pci-bus as the PF b/c it physically is, and the register
set for that parent bridge is the one the PF is connected to ...
bus->self == NULL will cause heartburn for AERs associated with VFs .... no dev to
point to in order to do AER handling ... :(
Likewise, additional bus structs could be generated, but they would require
ptrs in them to ensure they point to the same structures (same dev-lists, same
resource structs, etc.) and/or inherit the PF fields/attributes.
Changes to a pci-dev associated with a bridge will also have to check if
'peer' (VF/SRIOV) busses are associated with that bridge/pci-dev, and duplicate
updates -- which begs for the same bus struct to be used.

I think we have to break the identification of devices, B:D.F (VFs specifically)
and the existing 1-to-1 relationship with a hw device. Helper functions
to change VF B:D.F formatting to be PF-B:[potentially->-32-D]:F would be one way to do it.
Changing devfn in pci-dev struct from u8 to u16 would one option;
putting the entire B:D.F into pci-dev and making it u32 would probably be the cleanest
(and I'm familiar with at least one OS & it's PCI developer that did just that ;-),
but not for these ARI reasons), and then macros like PCI_FUNC() and PCI_SLOT()
could be changed appropriately, as well as a new PCI_BUS() macro.
The only serious gotcha I see in the B:D.F interpretation is how it is presented in
sysfs, since a D value >32 could cause some heartburn for user apps traversing
sysfs, and how those values would get copy / transposed to other interaces,
e.g., libvirt scanning sysfs for VFs, and how the B:D.F is parsed and passed
down to qemu when a device is assigned to a guest. The 32-bit B:D.F in pci-dev,
and new helpers to present & parse based on that field are looking better to me now...

As for PCI tree traversal, endpoints shouldn't be doing it; I'll bet the majority
that do, are looking for the parent bus to get/print the bus-num;
That leaves PCI core code for bus scanning, hot plug, ACS tree traversal, etc.,
which should be manageable in a change like the ones I mention in the previous paragraph.
Alex: does VFIO use pci functions to do traversal, or does it implement its own
functions to travers pci trees?

So, as stated previously, this certainly hurts the brain cells, but not as much
as I thought it would.
Someday, I'll try to figure out how mr-iov-based busses come & go in the pci dev tree! :-o !

- Don


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/