Re: [RFC PATCH] hwmon: (peci/cputemp) Number cores as seen by host system

From: Zev Weiss
Date: Thu Feb 09 2023 - 20:48:52 EST


On Thu, Feb 09, 2023 at 04:26:47PM PST, Guenter Roeck wrote:
On 2/9/23 16:14, Zev Weiss wrote:
On Thu, Feb 09, 2023 at 09:50:01AM PST, Guenter Roeck wrote:
On Wed, Feb 08, 2023 at 05:16:32PM -0800, Zev Weiss wrote:
While porting OpenBMC to a new platform with a Xeon Gold 6314U CPU
(Ice Lake, 32 cores), I discovered that the core numbering used by the
PECI interface appears to correspond to the cores that are present in
the physical silicon, rather than those that are actually enabled and
usable by the host OS (i.e. it includes cores that the chip was
manufactured with but later had fused off).

Thus far the cputemp driver has transparently exposed that numbering
to userspace in its 'tempX_label' sysfs files, making the core numbers
it reported not align with the core numbering used by the host system,
which seems like an unfortunate source of confusion.

We can instead use a separate counter to label the cores in a
contiguous fashion (0 through numcores-1) so that the core numbering
reported by the PECI cputemp driver matches the numbering seen by the
host.


I don't really have an opinion if this change is desirable or not.
I suspect one could argue either way. I'l definitely want to see
feedback from others. Any comments or thoughts, anyone ?


Agreed, I'd definitely like to get some input from Intel folks on this.

Though since I realize my initial email didn't quite explain this explicitly, I should probably clarify with an example how weird the numbering can get with the existing code -- on the 32-core CPU I'm working with at the moment, the tempX_label files produce the following core numbers:

    Core 0
    Core 1
    Core 2
    Core 3
    Core 4
    Core 5
    Core 6
    Core 7
    Core 8
    Core 9
    Core 11
    Core 12
    Core 13
    Core 14
    Core 15
    Core 18
    Core 20
    Core 22
    Core 23
    Core 24
    Core 26
    Core 27
    Core 28
    Core 29
    Core 30
    Core 31
    Core 33
    Core 34
    Core 35
    Core 36
    Core 38
    Core 39

i.e. it's not just a different permutation of the expected core numbers, we end up with gaps (e.g. the nonexistence of core 10), and core numbers well in excess of the number of cores the processor really "has" (e.g. number 39) -- all of which seems like a rather confusing thing to see in your BMC's sensor readings.


Sure, but what do you see with /proc/cpuinfo and with coretemp
on the host ? It might be even more confusing if the core numbers
reported by the peci driver don't match the core numbers provided
by other tools.


The host sees them numbered as the usual 0-31 you'd generally expect, and assigned to those cores in the same increasing order -- hence the patch bringing the two into alignment with each other. Currently only cores 0 through 9 match up between the two, and the rest are off by somewhere between one and eight.


Zev