Re: [PATCH 1/2 v2] PCI/portdrv: Enable reporting inter-switch P2P links

From: Jonathan Cameron
Date: Fri Oct 04 2024 - 06:57:46 EST


On Thu, 3 Oct 2024 14:41:07 -0600
Sumanesh Samanta <sumanesh.samanta@xxxxxxxxxxxx> wrote:

> Hi Jonathan,
>
> >> Need more data that 'there is a link' for this.
> >>I'd like to see some info on bandwidth and latency.
>
> As you too noted in your comments, for now, we are only addressing p2p
> connection between "virtual switches", i.e. switches that look
> different to the host, but are actually part of the same physical
> hardware.
> Given that, I am not sure what we should display for bandwidth and
> latency. There is no physical link to traverse between the virtual
> switches, and usually we take that as "infinite" bandwidth and "zero"
> latency.

For a case where you have no information, not having attributes is
sensible. If there is information (CXL CDAT provides this for switches
for instance) then we should have an interface that provides space for
that information.

> As such, any number here will make little sense until we
> start supporting p2p connection between physical switches.

As above, it makes sense in a switch as well - if the information
is available.

> We could,
> of course, have some encoding for the time being, like have "INF" for
> bandwidth and 0 for latency, but again, those will not be very useful
> till the day this scheme is extended to physical switch and we display
> real values, like bandwidth and latency for a x16 PCIe link. Thoughts?

Hide the sysfs attributes for latency and bandwidth if we simply don't
know. Software built on top of this can then assume full bandwidth
is available or better still run some measurements to establish the
missing data.

All I really meant by this suggestion is a directory with space for
other info is probably more extensible than a single file.

Jonathan

>
> sincerely,
> Sumanesh
>
>
> On Tue, Sep 24, 2024 at 8:57 AM Jonathan Cameron
> <Jonathan.Cameron@xxxxxxxxxx> wrote:
> >
> > On Thu, 19 Sep 2024 01:13:43 -0700
> > Shivasharan S <shivasharan.srikanteshwara@xxxxxxxxxxxx> wrote:
> >
> > > Broadcom PCI switches supports inter-switch P2P links between two
> > > PCI-to-PCI bridges. This presents an optimal data path for data
> > > movement. The patch exports a new sysfs entry for PCI devices that
> > > support the inter switch P2P links and reports the B:D:F information
> > > of the devices that are connected through this inter switch link as
> > > shown below:
> > >
> > > Host root bridge
> > > ---------------------------------------
> > > | |
> > > NIC1 --- PCI Switch1 --- Inter-switch link --- PCI Switch2 --- NIC2
> > > (2c:00.0) (2a:00.0) (3d:00.0) (40:00.0)
> > > | |
> > > GPU1 GPU2
> > > (2d:00.0) (3f:00.0)
> > > SERVER 1
> > >
> > > $ find /sys/ -name "p2p_link" | xargs grep .
> > > /sys/devices/pci0000:29/0000:29:01.0/0000:2a:00.0/p2p_link:0000:3d:00.0
> > > /sys/devices/pci0000:3c/0000:3c:01.0/0000:3d:00.0/p2p_link:0000:2a:00.0
> > >
> > > Current support is added to report the P2P links available for
> > > Broadcom switches based on the capability that is reported by the
> > > upstream bridges through their vendor-specific capability registers.
> > >
> > > Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@xxxxxxxxxxxx>
> > > Signed-off-by: Sumanesh Samanta <sumanesh.samanta@xxxxxxxxxxxx>
> > > ---
> > > Changes in v2:
> > > Integrated the code into PCIe portdrv to create the sysfs entries during
> > > probe, as suggested by Mani.
> >
> > Hmm. So we are trying to rework portdrv in general to support extensibility.
> > I'm a little nervous about taking in vendor specific code in the meantime
> > even if it is trivial like this is. We will be having an extensible
> > discovery scheme but for now the plan is that will be child device based
> > for non PCI standard features.
> >
> > It is a fairly small bit of code, so maybe it is fine - I'm not keen
> > on adding the implementation of the vendor specific parts to the
> > main driver though. Push that to an optional c file.
> >
> > A few general comments inline.
> >
> > >
> > > Documentation/ABI/testing/sysfs-bus-pci | 14 +++
> > > drivers/pci/pcie/portdrv.c | 131 ++++++++++++++++++++++++
> > > drivers/pci/pcie/portdrv.h | 10 ++
> > > 3 files changed, 155 insertions(+)
> > >
> > > diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> > > index ecf47559f495..e5d02f20655f 100644
> > > --- a/Documentation/ABI/testing/sysfs-bus-pci
> > > +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > > @@ -572,3 +572,17 @@ Description:
> > > enclosure-specific indications "specific0" to "specific7",
> > > hence the corresponding led class devices are unavailable if
> > > the DSM interface is used.
> > > +
> > > +What: /sys/bus/pci/devices/.../p2p_link
> > > +Date: September 2024
> > > +Contact: Shivasharan S <shivasharan.srikanteshwara@xxxxxxxxxxxx>
> > > +Description:
> > > + This file appears on PCIe upstream ports which supports an
> > > + internal P2P link.
> > > + Reading this attribute will provide the list of other upstream
> > > + ports on the system which have an internal P2P link available
> > > + between the two ports.
> >
> > Given this only applies to 'internal' links and not inter switch physical links
> > I think you should rename it. An eventual p2p link between physical switches
> > is going to be much more complex as will need routing information.
> > Let us avoid trampling on that space.
> >
> > > +Users:
> > > + Userspace applications interested in determining a optimal P2P
> > > + link for data transfers between devices connected to the PCIe
> > > + switches.
> >
> > Need more data that 'there is a link' for this.
> > I'd like to see some info on bandwidth and latency. I've previously been
> > in discussions about similar devices that provide a low latency but low
> > bandwidth direct path. That gets more likely if we scale up to
> > multiple physical switches or the software stack is choosing between
> > multiple p2p targets (e.g. getting nearest path to a multiheaded NIC).
> >
> > Perhaps best bet is to leave space for that by using a directory
> > here to cover everything about p2p? Maybe have links under there to the
> > other upstream ports? That might be fiddly to manage though.
> >
> > > diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> > > index 6af5e0425872..c940b4b242fd 100644
> > > --- a/drivers/pci/pcie/portdrv.c
> > > +++ b/drivers/pci/pcie/portdrv.c
> > > @@ -18,6 +18,7 @@
> > > #include <linux/string.h>
> > > #include <linux/slab.h>
> > > #include <linux/aer.h>
> > > +#include <linux/bitops.h>
> > >
> > > #include "../pci.h"
> > > #include "portdrv.h"
> > > @@ -37,6 +38,134 @@ struct portdrv_service_data {
> > > u32 service;
> > > };
> > >
> > > +/**
> > > + * pcie_brcm_is_p2p_supported - Broadcom device specific handler
> > > + * to check if the upstream port supports inter switch P2P.
> > > + *
> > > + * @dev: PCIe upstream port to check
> > > + *
> > > + * This function assumes the PCIe upstream port is a Broadcom
> > > + * PCIe device.
> > > + */
> > > +static bool pcie_brcm_is_p2p_supported(struct pci_dev *dev)
> >
> > Put this in a separate c file + use a config option and some
> > stubs to make it go away if people don't want to support it.
> > The layering and separation in portdrv is currently messy but
> > we shouldn't make it worse :(
> >
> > > +{
> > > + u64 dsn;
> > > + u16 vsec;
> > > + u32 vsec_data;
> > > +
> > > + dsn = pci_get_dsn(dev);
> > > + if (!dsn) {
> > > + pci_dbg(dev, "DSN capability is not present\n");
> > > + return false;
> > > + }
> >
> > Why get the dsn (which will frequently exist on other devices)
> > before getting the vsec which won't? Reorder these first
> > two checks. For most devices the match on vendor will fail in the
> > vsec check.
> >
> > > +
> > > + vsec = pci_find_vsec_capability(dev, PCI_VENDOR_ID_LSI_LOGIC,
> > > + PCIE_BRCM_SW_P2P_VSEC_ID);
> > > + if (!vsec) {
> > > + pci_dbg(dev, "Failed to get VSEC capability\n");
> > > + return false;
> > > + }
> > > +
> > > + pci_read_config_dword(dev, vsec + PCIE_BRCM_SW_P2P_MODE_VSEC_OFFSET,
> > > + &vsec_data);
> > > +
> > > + pci_dbg(dev, "Serial Number: 0x%llx VSEC 0x%x\n",
> > > + dsn, vsec_data);
> > > +
> > > + if (!PCIE_BRCM_SW_IS_SECURE_PART(dsn))
> >
> > Add a comment on this. Not obvious what this is checking as it's picking
> > a single bit out of a serial number...
> >
> > > + return false;
> > > +
> > > + if (FIELD_GET(PCIE_BRCM_SW_P2P_MODE_MASK, vsec_data) !=
> > > + PCIE_BRCM_SW_P2P_MODE_INTER_SW_LINK)
> > > + return false;
> > > +
> > > + return true;
> > return FIELD_GET(PCIE_BRCM_SW_P2P_MODE_MASK, vsec_data) ==
> > PCIE_BRCM_SW_P2P_MODE_INTER_SW_LINK;
> > perhaps.
> >
> > > +}
> > > +
> > > +/**
> > > + * Determine if device supports Inter switch P2P links.
> > > + *
> > > + * Return value: true if inter switch P2P is supported, return false otherwise.
> > > + */
> > > +static bool pcie_port_is_p2p_supported(struct pci_dev *dev)
> > > +{
> > > + /* P2P link attribute is supported on upstream ports only */
> > > + if (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM)
> > > + return false;
> > > +
> > > + /*
> > > + * Currently Broadcom PEX switches are supported.
> > > + */
> > > + if (dev->vendor == PCI_VENDOR_ID_LSI_LOGIC &&
> > > + (dev->device == PCI_DEVICE_ID_BRCM_PEX_89000_HLC ||
> > > + dev->device == PCI_DEVICE_ID_BRCM_PEX_89000_LLC))
> > > + return pcie_brcm_is_p2p_supported(dev);
> > > +
> > > + return false;
> > > +}
> > > +
> > > +/*
> > > + * Traverse list of all PCI bridges and find devices that support Inter switch P2P
> > > + * and have the same serial number to create report the BDF over sysfs.
> > > + */
> > > +static ssize_t p2p_link_show(struct device *dev, struct device_attribute *attr,
> > > + char *buf)
> > > +{
> > > + struct pci_dev *pdev = to_pci_dev(dev), *pdev_link = NULL;
> > > + size_t len = 0;
> > > + u64 dsn, dsn_link;
> > > +
> > > + dsn = pci_get_dsn(pdev);
> >
> > Maybe add a comment that we don't need to repeat checks that were done
> > to make the attribute visible. Hence dsn will exist and it will be p2p link capable.
> >
> > > +
> > > + /* Traverse list of PCI bridges to determine any available P2P links */
> > > + while ((pdev_link = pci_get_class(PCI_CLASS_BRIDGE_PCI << 8, pdev_link))
> >
> > Feels like we should have a for_each_pci_bridge() similar to for_each_pci_dev()
> > that does this, but that is already defined to mean something else...
> >
> > Bjorn, is this something we should be looking to make more consistent
> > perhaps with naming to make it clear what is a search of all instances
> > on any bus and what is a search below a particular bus?
> >
> > > + != NULL) {
> > > + if (pdev_link == pdev)
> > > + continue;
> > > +
> > > + if (!pcie_port_is_p2p_supported(pdev_link))
> > > + continue;
> > > +
> > > + dsn_link = pci_get_dsn(pdev_link);
> > > + if (!dsn_link)
> > > + continue;
> > > +
> > > + if (dsn == dsn_link)
> > > + len += sysfs_emit_at(buf, len, "%04x:%02x:%02x.%d\n",
> > > + pci_domain_nr(pdev_link->bus),
> > > + pdev_link->bus->number, PCI_SLOT(pdev_link->devfn),
> > > + PCI_FUNC(pdev_link->devfn));
> > > + }
> > > +
> > > + return len;
> > > +}
> >
> >
> > > diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
> > > index 12c89ea0313b..1be06cb45665 100644
> > > --- a/drivers/pci/pcie/portdrv.h
> > > +++ b/drivers/pci/pcie/portdrv.h
> > > @@ -25,6 +25,16 @@
> > >
> > > #define PCIE_PORT_DEVICE_MAXSERVICES 5
> > >
> > > +/* P2P Link supported device IDs */
> > > +#define PCI_DEVICE_ID_BRCM_PEX_89000_HLC 0xC030
> > > +#define PCI_DEVICE_ID_BRCM_PEX_89000_LLC 0xC034
> > > +
> > > +#define PCIE_BRCM_SW_P2P_VSEC_ID 0x1
> > > +#define PCIE_BRCM_SW_P2P_MODE_VSEC_OFFSET 0xC
> > > +#define PCIE_BRCM_SW_P2P_MODE_MASK GENMASK(9, 8)
> > > +#define PCIE_BRCM_SW_P2P_MODE_INTER_SW_LINK 0x2
> > > +#define PCIE_BRCM_SW_IS_SECURE_PART(dsn) ((dsn) & 0x8)
> > Define the mask, and use FIELD_GET() to get that.
> > > +
> > > extern bool pcie_ports_dpc_native;
> > >
> > > #ifdef CONFIG_PCIEAER
> >
>