Re: [PATCH] Add support for reading SMBIOS Slot number for PCI devices

From: Jordan Hargrave
Date: Tue Nov 10 2015 - 09:40:26 EST


On Thu, Jul 23, 2015 at 10:11 PM, Jordan Hargrave <jharg93@xxxxxxxxx> wrote:
> On Thu, Jul 23, 2015 at 9:31 PM, Jordan Hargrave <jharg93@xxxxxxxxx> wrote:
>> On Thu, Jul 23, 2015 at 12:24 PM, Bjorn Helgaas <bhelgaas@xxxxxxxxxx> wrote:
>>> [+cc Matthew for PCIe-SSD perspective]
>>>
>>> On Wed, Jul 22, 2015 at 03:07:46PM -0500, Jordan Hargrave wrote:
>>>> On Tue, Jul 21, 2015 at 8:09 PM, Bjorn Helgaas <bhelgaas@xxxxxxxxxx> wrote:
>>>> > On Tue, Jul 21, 2015 at 12:31:35PM -0500, Jordan Hargrave wrote:
>>>> >> On Tue, Jul 21, 2015 at 11:57 AM, Bjorn Helgaas <bhelgaas@xxxxxxxxxx> wrote:
>>>> >> > On Mon, Jul 13, 2015 at 09:57:32AM -0500, Jordan Hargrave wrote:
>>>> >> >> On Mon, Jul 13, 2015 at 2:35 AM, Jean Delvare <jdelvare@xxxxxxx> wrote:
>>>> >> >>
>>>> >> >> > Hi Jordan,
>>>> >> >> >
>>>> >> >> > On Fri, 10 Jul 2015 17:02:46 -0500, Jordan Hargrave wrote:
>>>> >> >> > > From: Jordan Hargrave <Jordan_Hargrave@xxxxxxxx>
>>>> >> >> > >
>>>> >> >> > > There currently isn't an easy way to determine which PCI devices belong
>>>> >> >> > to
>>>> >> >> > > system slots. This patch adds support to read SMBIOS Type 9 (System
>>>> >> >> > Slots).
>>>> >> >> >
>>>> >> >> > I'm wondering, can't you use dmidecode or libsmbios to retrieve the
>>>> >> >> > same information?
>>>> >> >> >
>>>> >> >> > --
>>>> >> >> > Jean Delvare
>>>> >> >> > SUSE L3 Support
>>>> >> >> >
>>>> >> >>
>>>> >> >> You can but it's as not easy to determine the slot number for leaf devices
>>>> >> >> on bridges. Eventually planning on using this for pulling slot number for
>>>> >> >> identifying network cards and disk numbering for systemd
>>>> >> >
>>>> >> > Can you outline the problems with using dmidecode or libsmbios?
>>>> >>
>>>> >> Neither dmidecode nor libsmbios report the slot number for devices
>>>> >> behind bridges in a slot.
>>>> >
>>>> > True, but it's straightforward to walk up the PCI tree in sysfs, e.g.,
>>>> > given a path like /sys/devices/pci0000:00/0000:00:1c.5/0000:03:00.0/, it's
>>>> > easy to see what the upstream bridges are.
>>>> >
>>>> It makes it more complicated I think. You have to check all functions
>>>> on all devices as well on the walk to the root.
>>>>
>>>> Would it look something like this?
>>>> while (pdev) {
>>>> if (pdev->bus->number == dslot->bus && PCI_SLOT(pdev->devfn) ==
>>>> PCI_SLOT(dslot->devfn) &&
>>>> (pci_pcie_type(pdev) != PCI_ROOT_PORT || PCI_FUNC(pdev->devfn) ==
>>>> PCI_FUNC(dslot->devfn))
>>>> return dslot->instance;
>>>> if (!pdev->bus->parent)
>>>> break;
>>>> pdev = pdev->bus->parent->self;
>>>> }
>>>>
>>>> >> I'm wanting to use this sysfs variable to
>>>> >> get slot numbers for systemd, so using libsmbios and dmidecode aren't
>>>> >> very useful.
>>>> >
>>>> > If you want this in systemd, I see why you wouldn't want a command like
>>>> > dmidecode. Help me understand the problem with libsmbios. Is it not
>>>> > useful because (a) systemd doesn't want to link with it, or (b) libsmbios
>>>> > doesn't have the right information, or (c) something else?
>>>> >
>>>> Linking with libsmbios would be a problem, and libsmbios doesn't have
>>>> this info anyway.
>>>>
>>>> > It doesn't *look* like this is using any information that is only available
>>>> > in the kernel, so in principle it seems like this could be done in
>>>> > user-space.
>>>> >
>>>>
>>>> I actually just got an email from someone who needs to determine the
>>>> card slot number in their driver... so I've added an external callable
>>>> 'pci_get_smbios_slot' function to enable this.
>>>>
>>>> >> We already report the index for embedded devices in
>>>> >> pci-label.c, this code should have gone in at the same time.
>>>> >>
>>>> >> For example. The SMBIOS entry for slot 3 is 40:00.0 There is a
>>>> >> quad-port NIC in the slot with a bridge at 40:00.0
>>>> >>
>>>> >> 42:00.0 Bridge (sec=43, sub=45)
>>>> >> 43:02.0 Bridge (sec=44, sub=44)
>>>> >> 43:04.0 Bridge (sec=45, sub=45)
>>>> >> 44:00.0 Ethernet
>>>> >> 44:00.1 Ethernet
>>>> >> 45:00.0 Ethernet
>>>> >> 45:00.1 Ethernet
>>>> >>
>>>> >> So dmidecode only returns the slot number for 42:00.0 but not any
>>>> >> child devices. This code will provide a 'slot' sysfs variable that
>>>> >> reports '3' for all devices under and including the bridge.
>>>> >
>>>> > What if the card in slot 3 is an adapter leading to an external PCI
>>>> > chassis? Wouldn't we then have a 'slot' file for every card in that
>>>> > chassis, all containing '3'? This sounds confusing, although it is true
>>>> > that they all would be connected via the system board slot 3.
>>>> >
>>>> Yes that is correct. Unless SMBIOS had a table of the second chassis.
>>>>
>>>> > Also, we do have the /sys/bus/pci/slots/ hierarchy already. If we do put
>>>> > something like this in the kernel, how would it relate to that hierarchy?
>>>> > Could this SMBIOS stuff be integrated into that somehow?
>>>> >
>>>>
>>>> /sys/bus/pci/slots only map a slot to a single PCI device. systemd
>>>> does use /sys/bus/pci/slots but it can't see the slot number on cards
>>>> with bridges as the actual slot number is a parent. And that's not
>>>> easily to determine a parent device from the slots interface. I'd
>>>> really like something generic here. I'm also looking for some method
>>>> for reporting bay/enclosure ID for PCIe-SSD devices.
>>>>
>>>> > We have a bit of a hodge-podge of slot names already, and I'd like to
>>>> > simplify things if we can.
>>>
>>> We aren't really converging here yet.
>>>
>>> We need to figure out exactly what has to be done in the kernel because it
>>> can't be done in user-space. So far, your patch *looks* like it could (in
>>> principle) be done in user-space, given a sysfs PCI hierarchy and the
>>> SMBIOS information.
>>
>> It's a pain to do in user space as you have to read DMI information,
>> then traverse PCI hierarchy until you find a match. Every utility or
>> code that wants to match a PCI device to a SMBIOS slot must reinvent
>> the wheel. biosdevname has its own code to read the entire SMBIOS
>> table. It then reads in all pci devices and attempts to do a mapping.
>> Why do this when the kernel already has this info easily available.
>>
>>>
>>> The goal ("determine which PCI devices belong to system slots") is
>>> definitely generic and useful. We have quite a bit of slot stuff already
>>> in the kernel, and some is already exposed via sysfs, so if we're still
>>> missing what you need, it seems like the current code is off the rails
>>> somewhere and should be fixed.
>>>
>>> Can we explore what you need in a little more detail, with some concrete
>>> examples? I heard:
>>>
>>> - Systemd uses /sys/bus/pci/slots but gets the wrong slot information for
>>> cards with bridges on them. Can you give an example including all the
>>> PCI devices involved and the related /sys/bus/pci/slots info so we can
>>> see exactly what's wrong?
>>
>> I created a quick and dirty module that populates /sys/bus/pci/slots:
>>
>> #include <linux/kernel.h>
>> #include <linux/module.h>
>> #include <linux/pci.h>
>> #include <linux/dmi.h>
>>
>> static int __init kslot_init(void)
>> {
>> const struct dmi_device *dmi;
>> struct dmi_dev_onboard *dslot;
>> struct pci_dev *sdev;
>>
>> dmi = NULL;
>> while ((dmi = dmi_find_device(DMI_DEV_TYPE_SYSTEM_SLOT, NULL,
>> dmi)) != NULL) {
>> dslot = dmi->device_data;
>> sdev = pci_get_domain_bus_and_slot(dslot->segment, dslot->bus,
>> dslot->devfn);
>> if (!sdev)
>> continue;
>> pci_create_slot(sdev->bus, dslot->instance, dslot->dev.name,
>> NULL);
>> pci_dev_put(sdev);
>> }
>> return 0;
>> }
>>
>> static void __exit kslot_exit(void)
>> {
>> }
>>
>> module_init(kslot_init);
>> module_exit(kslot_exit);
>>
>> MODULE_AUTHOR("jordan_hargrave@xxxxxxxx");
>> MODULE_LICENSE("GPL");
>> ===
>>
>> # ls -l /sys/bus/pci/slots
>> total 0
>> drwxr-xr-x 2 root root 0 Jul 23 21:43 PCI1
>> drwxr-xr-x 2 root root 0 Jul 23 21:43 PCI2
>> drwxr-xr-x 2 root root 0 Jul 23 21:43 PCI3
>>
>> # for X in /sys/bus/pci/slots/* ; do echo -n "$X = "; cat $X/address; done
>> /sys/bus/pci/slots/PCI1 = 0000:41:01
>> /sys/bus/pci/slots/PCI2 = 0000:42:02
>> /sys/bus/pci/slots/PCI3 = 0000:04:03
>>
>> So one problem with this already is the string that gets output. It's
>> <domain>:<bus>:<slot> so you still have the problem of what about
>> bus 43-45 that are children of bus 42.
>>
>> So wrote a script:
>> #!/usr/bin/bash
>> for X in /sys/bus/pci/devices/* ; do
>> RX=$(readlink -f $X)
>> for Z in /sys/bus/pci/slots/* ; do
>> NAME=$(basename $Z)
>> ADDR=$(cat $Z/address | cut -f1-2 -d:)
>> if [[ $RX =~ $ADDR ]] ; then
>> echo $RX,$NAME
>> fi
>> done
>> done
>>
>> for each PCI device you have to iterate all devices in
>> /sys/bus/pci/slots, and munge the 'address' to actually find a match
>>
>> output is:
>> /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.0,PCI3 [SMBIOS entry]
>> /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.1,PCI3
>> /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.2,PCI3
>> /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.3,PCI3
>> /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.4,PCI3
>> /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.5,PCI3
>> /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.6,PCI3
>> /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.7,PCI3
>> /sys/devices/pci0000:40/0000:40:01.0/0000:41:00.0,PCI1 [SMBIOS entry]
>> /sys/devices/pci0000:40/0000:40:03.0/0000:42:00.0,PCI2 [SMBIOS entry]
>> /sys/devices/pci0000:40/0000:40:03.0/0000:42:00.0/0000:43:02.0,PCI2
>> /sys/devices/pci0000:40/0000:40:03.0/0000:42:00.0/0000:43:04.0,PCI2
>> /sys/devices/pci0000:40/0000:40:03.0/0000:42:00.0/0000:43:02.0/0000:44:00.0,PCI2
>> /sys/devices/pci0000:40/0000:40:03.0/0000:42:00.0/0000:43:02.0/0000:44:00.1,PCI2
>> /sys/devices/pci0000:40/0000:40:03.0/0000:42:00.0/0000:43:04.0/0000:45:00.0,PCI2
>> /sys/devices/pci0000:40/0000:40:03.0/0000:42:00.0/0000:43:04.0/0000:45:00.1,PCI2
>>
>> And I'm not totally convinced this will work on some weird
>> configurations I've seen of SMBIOS entries. as in the B:D:F must
>> match, not just the bus number.
>
> Yeah just verified this won't work on this config.
>
> I have a lspci output that defines the following bridges
> (segment:bus:dev:func -> secondary:subordinate)
> 0000:00:01.0 -> 01:01
> 0000:00:02.0 -> 02:02
> 0000:00:03.0 -> 03:03
> 0000:00:03.2 -> 04:04
> 0000:00:1c.0 -> 05:05
> 0000:00:1c.4 -> 06:06
>
> The SMBIOS table defines the following entries:
> Descriptor: MEZZ1 ID:1 0000:00:03.0
> Descriptor: MEZZ2 ID:2 0000:00:03.2
>
> creating /sys/bus/pci/slots:
> # ls -l /sys/bus/pci/slots/
> total 0
> drwxr-xr-x 2 root root 0 Jul 23 23:04 MEZZ1
> drwxr-xr-x 2 root root 0 Jul 23 23:04 MEZZ2
>
> # for X in /sys/bus/pci/slots/* ; do echo -n "$X = "; cat $X/address; done
> /sys/bus/pci/slots/MEZZ1 = 0000:00:01
> /sys/bus/pci/slots/MEZZ2 = 0000:00:02
>
> Both entries are showing Bus 0 only... not 0:3.0 and 0:3.2
>>
>>>
>>> - A driver needs the card slot number. A slot number seems like a user
>>> interface thing, so I'm surprised that a driver would be concerned with
>>> it. And of course, SMBIOS is an arch-specific thing, so the driver
>>> would have to be able to get along without the slot number anyway.
>>>
>> pci_get_smbios_slot would just return -ENODEV or something on systems
>> without SMBIOS.
>>
>>> - PCIe-SSD bay/enclosure IDs. Where would these IDs come from, and what
>>> sort of reporting mechanism are you looking for? How are these
>>> structured? Is there a separate PCIe device for every bay/enclosure
>>> ID, so there would be a 1:1 mapping from PCIe device to ID?
>>>
>> On our system the enclosure IDs actually come from IPMI commands.
>> However you must query every PCI Bus number in the system to get the
>> correct enclosure mapping. Currently yes there is a 1:1 mapping.
>>
>>> Bjorn

The systemd maintainers are still refusing to add in code to parse
SMBIOS directly, so we need for this to be in the kernel as a sysfs
variable. Perferrably per-PCI device instead of /sys/bus/pci/XXX/slots
as that only gets populated for the upstream device if a card has one
or more levels of bridges.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/