Re: [PATCH 1/2 v2] PCI/portdrv: Enable reporting inter-switch P2P links

From: Shivasharan Srikanteshwara
Date: Mon Oct 14 2024 - 05:44:13 EST




On Fri, Oct 4, 2024 at 4:09 PM Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> wrote:
>
> On Thu, 3 Oct 2024 14:41:07 -0600
> Sumanesh Samanta <sumanesh.samanta@xxxxxxxxxxxx> wrote:
>
> > Hi Jonathan,
> >
> > >> Need more data that 'there is a link' for this.
> > >>I'd like to see some info on bandwidth and latency.
> >
> > As you too noted in your comments, for now, we are only addressing p2p
> > connection between "virtual switches", i.e. switches that look
> > different to the host, but are actually part of the same physical
> > hardware.
> > Given that, I am not sure what we should display for bandwidth and
> > latency. There is no physical link to traverse between the virtual
> > switches, and usually we take that as "infinite" bandwidth and "zero"
> > latency.
>
> For a case where you have no information, not having attributes is
> sensible. If there is information (CXL CDAT provides this for switches
> for instance) then we should have an interface that provides space for
> that information.
>
> > As such, any number here will make little sense until we
> > start supporting p2p connection between physical switches.
>
> As above, it makes sense in a switch as well - if the information
> is available.
>
> > We could,
> > of course, have some encoding for the time being, like have "INF" for
> > bandwidth and 0 for latency, but again, those will not be very useful
> > till the day this scheme is extended to physical switch and we display
> > real values, like bandwidth and latency for a x16 PCIe link. Thoughts?
>
> Hide the sysfs attributes for latency and bandwidth if we simply don't
> know.  Software built on top of this can then assume full bandwidth
> is available or better still run some measurements to establish the
> missing data.
>
> All I really meant by this suggestion is a directory with space for
> other info is probably more extensible than a single file.

Hi Jonathan,
We will make the changes to add a directory for p2p_link related information
to be exposed to user. We will only populate the information related to the
inter-switch P2P links. Rest of the attributes can be added for devices that
report them at a later stage.
Please check if the directory structure makes sense to you:
/sys/devices/.../B:D:F/p2p_link/links -> Reading this file will return the same
information that is returned currently by the p2p_link file. 

>
> Jonathan
>
> >
> > sincerely,
> > Sumanesh
> >
> >
> > On Tue, Sep 24, 2024 at 8:57 AM Jonathan Cameron
> > <Jonathan.Cameron@xxxxxxxxxx> wrote:
> > >
> > > On Thu, 19 Sep 2024 01:13:43 -0700
> > > Shivasharan S <shivasharan.srikanteshwara@xxxxxxxxxxxx> wrote:
> > >
> > > > Broadcom PCI switches supports inter-switch P2P links between two
> > > > PCI-to-PCI bridges. This presents an optimal data path for data
> > > > movement. The patch exports a new sysfs entry for PCI devices that
> > > > support the inter switch P2P links and reports the B:D:F information
> > > > of the devices that are connected through this inter switch link as
> > > > shown below:
> > > >
> > > >                              Host root bridge
> > > >                 ---------------------------------------
> > > >                 |                                     |
> > > >   NIC1 --- PCI Switch1 --- Inter-switch link --- PCI Switch2 --- NIC2
> > > > (2c:00.0)   (2a:00.0)                             (3d:00.0)   (40:00.0)
> > > >                 |                                     |
> > > >                GPU1                                  GPU2
> > > >             (2d:00.0)                             (3f:00.0)
> > > >                                SERVER 1
> > > >
> > > > $ find /sys/ -name "p2p_link" | xargs grep .
> > > > /sys/devices/pci0000:29/0000:29:01.0/0000:2a:00.0/p2p_link:0000:3d:00.0
> > > > /sys/devices/pci0000:3c/0000:3c:01.0/0000:3d:00.0/p2p_link:0000:2a:00.0
> > > >
> > > > Current support is added to report the P2P links available for
> > > > Broadcom switches based on the capability that is reported by the
> > > > upstream bridges through their vendor-specific capability registers.
> > > >
> > > > Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@xxxxxxxxxxxx>
> > > > Signed-off-by: Sumanesh Samanta <sumanesh.samanta@xxxxxxxxxxxx>
> > > > ---
> > > > Changes in v2:
> > > > Integrated the code into PCIe portdrv to create the sysfs entries during
> > > > probe, as suggested by Mani.
> > >
> > > Hmm. So we are trying to rework portdrv in general to support extensibility.
> > > I'm a little nervous about taking in vendor specific code in the meantime
> > > even if it is trivial like this is.  We will be having an extensible
> > > discovery scheme but for now the plan is that will be child device based
> > > for non PCI standard features.
> > >
> > > It is a fairly small bit of code, so maybe it is fine - I'm not keen
> > > on adding the implementation of the vendor specific parts to the
> > > main driver though. Push that to an optional c file.
> > >
> > > A few general comments inline.
> > >
> > > >
> > > >  Documentation/ABI/testing/sysfs-bus-pci |  14 +++
> > > >  drivers/pci/pcie/portdrv.c              | 131 ++++++++++++++++++++++++
> > > >  drivers/pci/pcie/portdrv.h              |  10 ++
> > > >  3 files changed, 155 insertions(+)
> > > >
> > > > diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> > > > index ecf47559f495..e5d02f20655f 100644
> > > > --- a/Documentation/ABI/testing/sysfs-bus-pci
> > > > +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > > > @@ -572,3 +572,17 @@ Description:
> > > >               enclosure-specific indications "specific0" to "specific7",
> > > >               hence the corresponding led class devices are unavailable if
> > > >               the DSM interface is used.
> > > > +
> > > > +What:                /sys/bus/pci/devices/.../p2p_link
> > > > +Date:                September 2024
> > > > +Contact:     Shivasharan S <shivasharan.srikanteshwara@xxxxxxxxxxxx>
> > > > +Description:
> > > > +             This file appears on PCIe upstream ports which supports an
> > > > +             internal P2P link.
> > > > +             Reading this attribute will provide the list of other upstream
> > > > +             ports on the system which have an internal P2P link available
> > > > +             between the two ports.
> > >
> > > Given this only applies to 'internal' links and not inter switch physical links
> > > I think you should rename it.  An eventual p2p link between physical switches
> > > is going to be much more complex as will need routing information.
> > > Let us avoid trampling on that space.
> > >
> > > > +Users:
> > > > +             Userspace applications interested in determining a optimal P2P
> > > > +             link for data transfers between devices connected to the PCIe
> > > > +             switches.
> > >
> > > Need more data that 'there is a link' for this.
> > > I'd like to see some info on bandwidth and latency. I've previously been
> > > in discussions about similar devices that provide a low latency but low
> > > bandwidth direct path.  That gets more likely if we scale up to
> > > multiple physical switches or the software stack is choosing between
> > > multiple p2p targets (e.g. getting nearest path to a multiheaded NIC).
> > >
> > > Perhaps best bet is to leave space for that by using a directory
> > > here to cover everything about p2p?  Maybe have links under there to the
> > > other upstream ports? That might be fiddly to manage though.
> > >
> > > > diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> > > > index 6af5e0425872..c940b4b242fd 100644
> > > > --- a/drivers/pci/pcie/portdrv.c
> > > > +++ b/drivers/pci/pcie/portdrv.c
> > > > @@ -18,6 +18,7 @@
> > > >  #include <linux/string.h>
> > > >  #include <linux/slab.h>
> > > >  #include <linux/aer.h>
> > > > +#include <linux/bitops.h>
> > > >
> > > >  #include "../pci.h"
> > > >  #include "portdrv.h"
> > > > @@ -37,6 +38,134 @@ struct portdrv_service_data {
> > > >       u32 service;
> > > >  };
> > > >
> > > > +/**
> > > > + * pcie_brcm_is_p2p_supported - Broadcom device specific handler
> > > > + *       to check if the upstream port supports inter switch P2P.
> > > > + *
> > > > + * @dev: PCIe upstream port to check
> > > > + *
> > > > + * This function assumes the PCIe upstream port is a Broadcom
> > > > + * PCIe device.
> > > > + */
> > > > +static bool pcie_brcm_is_p2p_supported(struct pci_dev *dev)
> > >
> > > Put this in a separate c file + use a config option and some
> > > stubs to make it go away if people don't want to support it.
> > > The layering and separation in portdrv is currently messy but
> > > we shouldn't make it worse :(
> > >
Understood. We will move the p2p_link creation code to a separate .c/.h file.
> > > > +{
> > > > +     u64 dsn;
> > > > +     u16 vsec;
> > > > +     u32 vsec_data;
> > > > +
> > > > +     dsn = pci_get_dsn(dev);
> > > > +     if (!dsn) {
> > > > +             pci_dbg(dev, "DSN capability is not present\n");
> > > > +             return false;
> > > > +     }
> > >
> > > Why get the dsn (which will frequently exist on other devices)
> > > before getting the vsec which won't?  Reorder these first
> > > two checks. For most devices the match on vendor will fail in the
> > > vsec check.
> > >
This will be fixed in the next version of this patch.
  
> > > > +
> > > > +     vsec = pci_find_vsec_capability(dev, PCI_VENDOR_ID_LSI_LOGIC,
> > > > +                                     PCIE_BRCM_SW_P2P_VSEC_ID);
> > > > +     if (!vsec) {
> > > > +             pci_dbg(dev, "Failed to get VSEC capability\n");
> > > > +             return false;
> > > > +     }
> > > > +
> > > > +     pci_read_config_dword(dev, vsec + PCIE_BRCM_SW_P2P_MODE_VSEC_OFFSET,
> > > > +                           &vsec_data);
> > > > +
> > > > +     pci_dbg(dev, "Serial Number: 0x%llx VSEC 0x%x\n",
> > > > +             dsn, vsec_data);
> > > > +
> > > > +     if (!PCIE_BRCM_SW_IS_SECURE_PART(dsn))
> > >
> > > Add a comment on this. Not obvious what this is checking as it's picking
> > > a single bit out of a serial number...
> > >
Will do. 
> > > > +             return false;
> > > > +
> > > > +     if (FIELD_GET(PCIE_BRCM_SW_P2P_MODE_MASK, vsec_data) !=
> > > > +             PCIE_BRCM_SW_P2P_MODE_INTER_SW_LINK)
> > > > +             return false;
> > > > +
> > > > +     return true;
> > >         return FIELD_GET(PCIE_BRCM_SW_P2P_MODE_MASK, vsec_data) ==
> > >                PCIE_BRCM_SW_P2P_MODE_INTER_SW_LINK;
> > > perhaps.
> > >
Will take care of this.
> > > > +}
> > > > +
> > > > +/**
> > > > + * Determine if device supports Inter switch P2P links.
> > > > + *
> > > > + * Return value: true if inter switch P2P is supported, return false otherwise.
> > > > + */
> > > > +static bool pcie_port_is_p2p_supported(struct pci_dev *dev)
> > > > +{
> > > > +     /* P2P link attribute is supported on upstream ports only */
> > > > +     if (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM)
> > > > +             return false;
> > > > +
> > > > +     /*
> > > > +      * Currently Broadcom PEX switches are supported.
> > > > +      */
> > > > +     if (dev->vendor == PCI_VENDOR_ID_LSI_LOGIC &&
> > > > +         (dev->device == PCI_DEVICE_ID_BRCM_PEX_89000_HLC ||
> > > > +          dev->device == PCI_DEVICE_ID_BRCM_PEX_89000_LLC))
> > > > +             return pcie_brcm_is_p2p_supported(dev);
> > > > +
> > > > +     return false;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Traverse list of all PCI bridges and find devices that support Inter switch P2P
> > > > + * and have the same serial number to create report the BDF over sysfs.
> > > > + */
> > > > +static ssize_t p2p_link_show(struct device *dev, struct device_attribute *attr,
> > > > +                          char *buf)
> > > > +{
> > > > +     struct pci_dev *pdev = to_pci_dev(dev), *pdev_link = NULL;
> > > > +     size_t len = 0;
> > > > +     u64 dsn, dsn_link;
> > > > +
> > > > +     dsn = pci_get_dsn(pdev);
> > >
> > > Maybe add a comment that we don't need to repeat checks that were done
> > > to make the attribute visible. Hence dsn will exist and it will be p2p link capable.
> > >
Will take care of this.

> > > > +
> > > > +     /* Traverse list of PCI bridges to determine any available P2P links */
> > > > +     while ((pdev_link = pci_get_class(PCI_CLASS_BRIDGE_PCI << 8, pdev_link))
> > >
> > > Feels like we should have a for_each_pci_bridge() similar to for_each_pci_dev()
> > > that does this, but that is already defined to mean something else...
> > >
> > > Bjorn, is this something we should be looking to make more consistent
> > > perhaps with naming to make it clear what is a search of all instances
> > > on any bus and what is a search below a particular bus?
> > >  
> > > > +                     != NULL) {
> > > > +             if (pdev_link == pdev)
> > > > +                     continue;
> > > > +
> > > > +             if (!pcie_port_is_p2p_supported(pdev_link))
> > > > +                     continue;
> > > > +
> > > > +             dsn_link = pci_get_dsn(pdev_link);
> > > > +             if (!dsn_link)
> > > > +                     continue;
> > > > +
> > > > +             if (dsn == dsn_link)
> > > > +                     len += sysfs_emit_at(buf, len, "%04x:%02x:%02x.%d\n",
> > > > +                                          pci_domain_nr(pdev_link->bus),
> > > > +                                          pdev_link->bus->number, PCI_SLOT(pdev_link->devfn),
> > > > +                                          PCI_FUNC(pdev_link->devfn));
> > > > +     }
> > > > +
> > > > +     return len;
> > > > +}
> > >
> > >
> > > > diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
> > > > index 12c89ea0313b..1be06cb45665 100644
> > > > --- a/drivers/pci/pcie/portdrv.h
> > > > +++ b/drivers/pci/pcie/portdrv.h
> > > > @@ -25,6 +25,16 @@
> > > >
> > > >  #define PCIE_PORT_DEVICE_MAXSERVICES   5
> > > >
> > > > +/* P2P Link supported device IDs */
> > > > +#define PCI_DEVICE_ID_BRCM_PEX_89000_HLC     0xC030
> > > > +#define PCI_DEVICE_ID_BRCM_PEX_89000_LLC     0xC034
> > > > +
> > > > +#define PCIE_BRCM_SW_P2P_VSEC_ID             0x1
> > > > +#define PCIE_BRCM_SW_P2P_MODE_VSEC_OFFSET    0xC
> > > > +#define PCIE_BRCM_SW_P2P_MODE_MASK           GENMASK(9, 8)
> > > > +#define PCIE_BRCM_SW_P2P_MODE_INTER_SW_LINK  0x2
> > > > +#define PCIE_BRCM_SW_IS_SECURE_PART(dsn)     ((dsn) & 0x8)
> > > Define the mask, and use FIELD_GET() to get that.
Will take care of this.

Best Regards,
Shivasharan

> > > > +
> > > >  extern bool pcie_ports_dpc_native;
> > > >
> > > >  #ifdef CONFIG_PCIEAER
> > >
> >
>

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature