Re: [PATCH v2 7/7] x86/resctrl: Determine if Sub-NUMA Cluster is enabled and initialize.

From: Peter Newman
Date: Thu Jun 22 2023 - 10:26:04 EST


Hi Tony,

On Wed, Jun 21, 2023 at 7:40 PM Tony Luck <tony.luck@xxxxxxxxx> wrote:
>
> There isn't a simple hardware enumeration to indicate to software that
> a system is running with Sub-NUMA Cluster enabled.
>
> Compare the number of NUMA nodes with the number of L3 caches to calculate
> the number of Sub-NUMA nodes per L3 cache.
>
> When Sub-NUMA cluster mode is enabled in BIOS setup the RMID counters
> are distributed equally between the SNC nodes within each socket.
>
> E.g. if there are 400 RMID counters, and the system is configured with
> two SNC nodes per socket, then RMID counter 0..199 are used on SNC node
> 0 on the socket, and RMID counter 200..399 on SNC node 1.
>
> A model specific MSR (0xca0) can change the configuration of the RMIDs
> when SNC mode is enabled.
>
> The MSR controls the interpretation of the RMID field in the
> IA32_PQR_ASSOC MSR so that the appropriate hardware counters
> within the SNC node are updated.
>
> Also initialize a per-cpu RMID offset value. Use this
> to calculate the value to write to the IA32_QM_EVTSEL MSR when
> reading RMID event values.
>
> N.B. this works well for well-behaved NUMA applications that access
> memory predominantly from the local memory node. For applications that
> access memory across multiple nodes it may be necessary for the user
> to read counters for all SNC nodes on a socket and add the values to
> get the actual LLC occupancy or memory bandwidth. Perhaps this isn't
> all that different from applications that span across multiple sockets
> in a legacy system.

Unfortunately I'm not getting as good of results with the new series.
The main difference seems to be updating the 0xca0 MSR instead of
applying the offset to PQR_ASSOC.

In my test case of generating bandwidth on 20 random CPUs in 20 random
RMIDs, I'm only getting correct counts from CPUs in node 0. Node 1
CPUs are showing counts which are too small, and nodes 2 and 3 are
seeing no bandwidth at all:

(expected bandwidth is around 30 GB, value in first parenthesis is L3 cache id)

cpu: 134 (0), rmid: 30 (g29): 0 -> 30640791552 (30640791552)
cpu: 138 (0), rmid: 103 (g101): 0 -> 28196962304 (28196962304)

cpu: 35 (0), rmid: 211 (g209): 0 -> 3039232 (3039232)
cpu: 55 (0), rmid: 113 (g111): 0 -> 4874240 (4874240)
cpu: 41 (0), rmid: 83 (g81): 0 -> 2637824 (2637824)
cpu: 42 (0), rmid: 218 (g216): 0 -> 2408448 (2408448)
cpu: 161 (0), rmid: 8 (g7): 0 -> 7856128 (7856128)

cpu: 86 (1), rmid: 171 (g169): 0 -> 0 (0)
cpu: 86 (1), rmid: 121 (g119): 0 -> 0 (0)
cpu: 212 (1), rmid: 163 (g161): 0 -> 0 (0)
cpu: 180 (1), rmid: 129 (g127): 0 -> 0 (0)
cpu: 205 (1), rmid: 186 (g184): 0 -> 0 (0)
cpu: 194 (1), rmid: 160 (g158): 0 -> 0 (0)
cpu: 186 (1), rmid: 196 (g194): 0 -> 0 (0)
cpu: 106 (1), rmid: 93 (g91): 0 -> 0 (0)
cpu: 84 (1), rmid: 168 (g166): 0 -> 0 (0)
cpu: 197 (1), rmid: 104 (g102): 0 -> 0 (0)
cpu: 64 (1), rmid: 103 (g101): 0 -> 0 (0)
cpu: 71 (1), rmid: 81 (g79): 0 -> 0 (0)
cpu: 60 (1), rmid: 221 (g219): 0 -> 0 (0)

Here's the output of `cat /sys/devices/system/node/node*/cpulist` on
this machine for reference:

0-27,112-139
28-55,140-167
56-83,168-195
84-111,196-223

-Peter