RE: [PATCH] EDAC/{i10nm,skx,skx_common}: Support multiple clumps
From: Luck, Tony
Date: Fri Dec 06 2024 - 16:26:24 EST
> So, we're back to the original issue on systems with multiple UPI/QPI domains
> when NUMA is disabled.
>
> Systems with multiple UPI/QPI domains can't use source IDs to map devices
> to sockets. skx_get_src_id() will successfully read the source ID from PCI
> configuration space registers but it might not map to the correct socket because
> each UPI/QPI domain has identical repeating source IDs.
>
> For example, 8 sockets with 2 UPI/QPI domains:
>
> Socket 0 -> Source ID 0
> Socket 1 -> Source ID 1
> Socket 2 -> Source ID 2
> Socket 3 -> Source ID 3
> Socket 4 -> Source ID 0
> Socket 5 -> Source ID 1
> Socket 6 -> Source ID 2
> Socket 7 -> Source ID 3
>
> EDAC will successfully load, but it will not find the the corresponding device
> for errors on sockets 4 though 7 (for example, see skx_common.c:178).
>
> Looking at my original patch, EDAC will not load when a system has multiple UPI/
> QPI domains and NUMA is disabled. We fail early with "Failed to get package ID
> from NUMA information" instead of later when an error occurs.
Looks like there are four, mostly orthogonal, configuration possibilities:
1) System: Clumps vs. no-clumps
2) BIOS: Option to pick UMA vs. NUMA description to OS
3) Kernel configuration: CONFIG_NUMA=y vs. "not set"
4) Kernel x86 boot param: numa=off (or noacpi,nohmat) vs. no option
Maybe not all permutations make sense? '2' and '4' may be effectively the same?
Can this be solved for systems that have clumps, but Linux is ignoring NUMA-ness?
-Tony