On 2022-07-23 2:56 p.m., Lucas De Marchi wrote:
On Fri, Jul 22, 2022 at 09:04:43AM -0400, Liang, Kan wrote:
On 2022-07-22 8:55 a.m., Lucas De Marchi wrote:
Hi Kan,
On Wed, Mar 17, 2021 at 10:59:33AM -0700, kan.liang@xxxxxxxxxxxxxxx
wrote:
From: Kan Liang <kan.liang@xxxxxxxxxxxxxxx>
A self-describing mechanism for the uncore PerfMon hardware has been
introduced with the latest Intel platforms. By reading through an MMIO
page worth of information, perf can 'discover' all the standard uncore
PerfMon registers in a machine.
The discovery mechanism relies on BIOS's support. With a proper BIOS,
a PCI device with the unique capability ID 0x23 can be found on each
die. Perf can retrieve the information of all available uncore PerfMons
from the device via MMIO. The information is composed of one global
discovery table and several unit discovery tables.
- The global discovery table includes global uncore information of the
die, e.g., the address of the global control register, the offset of
the global status register, the number of uncore units, the offset of
unit discovery tables, etc.
- The unit discovery table includes generic uncore unit information,
e.g., the access type, the counter width, the address of counters,
the address of the counter control, the unit ID, the unit type, etc.
The unit is also called "box" in the code.
Perf can provide basic uncore support based on this information
with the following patches.
To locate the PCI device with the discovery tables, check the generic
PCI ID first. If it doesn't match, go through the entire PCI device
tree
and locate the device with the unique capability ID.
The uncore information is similar among dies. To save parsing time and
space, only completely parse and store the discovery tables on the
first
die and the first box of each die. The parsed information is stored in
an
RB tree structure, intel_uncore_discovery_type. The size of the stored
discovery tables varies among platforms. It's around 4KB for a Sapphire
Rapids server.
If a BIOS doesn't support the 'discovery' mechanism, the uncore driver
will exit with -ENODEV. There is nothing changed.
Add a module parameter to disable the discovery feature. If a BIOS gets
the discovery tables wrong, users can have an option to disable the
feature. For the current patchset, the uncore driver will exit with
-ENODEV. In the future, it may fall back to the hardcode uncore driver
on a known platform.
Signed-off-by: Kan Liang <kan.liang@xxxxxxxxxxxxxxx>
I observed one issue when upgrading a kernel from 5.10 to 5.15 and after
bisecting it arrived to this commit. I also verified the same issue is
present in 5.19-rc7 and that the issue is gone when booting with
intel_uncore.uncore_no_discover.
Test system is a SPR host with a PVC gpu. Issue is that PVC is not
reaching pkg c6 state, even if we put it in rc6 state. It seems the pcie
link is not idling, preventing it to go to pkg c6.
PMON discovery in bios is set to "auto".
We do see the following on dmesg while going through this code path:
intel_uncore: Invalid Global Discovery State: 0xffffffffffffffff
0xffffffffffffffff 0xffffffffffffffff
On SPR, the uncore driver relies on the discovery table provided by the
BIOS/firmware. It looks like your BIOS/firmware is out of date. Could
you please update to the latest BIOS/firmware and have a try?
hum, the BIOS is up to date. It seems PVC itself has a 0x09a7 device
and it remains in D3, so the 0xffffffffffffffff we se below is
just the auto completion. No wonder the values don't match what we are
expecting here.
Is it expected the device to be in D0? Or should we do anything here to
move it to D0 before doing these reads?
It's OK to have a 0x09a7 device. But the device should not claim to
support the PMON Discovery if it doesn't comply the PMON discovery
mechanism.
See 1.10.1 Guidance on Finding PMON Discovery and Reading it in SPR
uncore document. https://cdrdv2.intel.com/v1/dl/getContent/642245
It demonstrates how the uncore driver find the device with the PMON
discovery mechanism.
Simply speaking, the uncore driver looks for a DVSEC
structure with an unique capability ID 0x23. Then it checks whether the
PMON discovery entry (0x1) is supported. If both are detected, it means
that the device comply the PMON discovery mechanism. The uncore driver
will be enabled to parse the discovery table.
AFAIK, the PVC gpu doesn't support the PMON discovery mechanism. I guess
the firmwire of the PVC gpu mistakenly set the PMON discovery entry
(0x1). You may want to check the extended capabilities (DVSEC) in the
PCIe configuration space of the PVC gpu device.
Thanks,
Kan