Re: [PATCH 1/2] x86/CPU/AMD: Present package as die instead of socket
From: Suravee Suthikulpanit
Date: Tue Jun 27 2017 - 09:07:28 EST
Hi Boris,
On 6/27/17 17:48, Borislav Petkov wrote:
On Tue, Jun 27, 2017 at 01:40:52AM -0500, Suravee Suthikulpanit wrote:
According to the Documentation/x86/topology.txt, AMD nomenclature for
package is NUMA node (or die).
You guys keep talking about die so you should add the definition of
"die" to that document so that we're all on the same page. In a separate
patch.
What we are trying point out here is that (NUMA) "node" and "die" are the same
thing in most of AMD processors, not necessary trying to introduce another term
here.
However, this is not the case on AMD family17h multi-die processor
platforms, which can have up to 4 dies per socket as shown in the
following system topology.
So what exactly does that mean? A die is a package on ZN and you can have up to 4 packages on a physical socket?
Yes. 4 packages (or 4 dies, or 4 NUMA nodes) in a socket.
Current logic interpretes package as socket (i.e. phys_proc_id is
socket id), which results in setting x86_has_numa_in_package, and omits
the DIE schedule domain.
Well, this is what we do on MCM Magny-Cours. Btw, here's where you
explain *why* you need the DIE domain.
As I have described in the cover letter, this patch series changes how kernel
derives cpu "package" from package-as-socket to package-as-die in order to fix
following issues on AMD family17h multi-die processor platforms:
* irqbalance fails to allocating IRQs to individual CPU within the die.
* The scheduler fails to load-balance across 8 threads within a die`
(e.g. running 8-thread application w/ taskset -c 0-7 ) with
the DIE schedule domain omitted due to x86_has_numa_in_package.
These issues are fixed when properly intepretes package as DIE.
For MCM Magny-Cours (family15h), we do not have this issue because the MC
sched-domain has the same cpumask as the DIE sched-domain. This is not the same
as for Zen cores (family17h) where the MC sched-domain is the CCX (2 CCX per die)
However, NUMA schedule domains are derived from
SRAT/SLIT, which assumes NUMA node is a die,
Ok, so I don't understand this: SRAT/SLIT should basically say that you
have 4 *logical* NUMA nodes on that *physical* socket. Is that what you
want? Or is not what you want? Because it makes sense to me that BIOS
should describe how the logical grouping of the cores in a node is.
And x86_numa_in_package_topology roughly means that we rely on SRAT/SLIT
to tell us the topology.
SRAT/SLIT tables are doing the right thing, where it describes topology among
*logical* NUMA nodes (dies). So, for the system view below:
System View (with 2 socket) :
--------------------
| -------------|------
| | | |
------------ ------------
| D1 -- D0 | | D7 -- D6 |
| | \/ | | | | \/ | |
SOCKET0 | | /\ | | | | /\ | | SOCKET1
| D2 -- D3 | | D4 -- D5 |
------------ ------------
| | | |
------|------------| |
--------------------
SLIT table is showing
node 0 1 2 3 4 5 6 7
0: 10 16 16 16 32 32 32 32
1: 16 10 16 16 32 32 32 32
2: 16 16 10 16 32 32 32 32
3: 16 16 16 10 32 32 32 32
4: 32 32 32 32 10 16 16 16
5: 32 32 32 32 16 10 16 16
6: 32 32 32 32 16 16 10 16
7: 32 32 32 32 16 16 16 10
However, SRAT/SLIT does not describe the DIE. So, using
x86_numa_in_packge_topology on multi-die Zen processor will result in missing
the DIE sched-domain for cpus within a die.
So why can't SRAT/SLIT do that just like they did on Magny-Cours MCM
packages?
Again, Magny-Cours MCM DIE sched-domain is the same as MC sched-domain. So, we
can omit the DIE sched-domain. Here is an example of /proc/schedstat
Magney-Cours cpu0
domain0 00000000,00000003 (SMT)
domain1 00000000,000000ff (MC which is the same as DIE)
domain2 00ff00ff,00ffffff (NUMA 1 hop)
domain3 ffffffff,ffffffff (NUMA platform)
Zen cpu0 (package-as-socket)
domain0 00000000,00000001,00000000,00000001 (SMT)
domain1 00000000,0000000f,00000000,0000000f (MC ccx)
domain2 00000000,ffffffff,00000000,ffffffff (NUMA socket)
domain3 ffffffff,ffffffff,ffffffff,ffffffff (NUMA platform)
Zen cpu0 (package-as-die)
domain0 00000000,00000001,00000000,00000001 (SMT)
domain1 00000000,0000000f,00000000,0000000f (MC ccx)
domain2 00000000,000000ff,00000000,000000ff (DIE)
domain3 00000000,ffffffff,00000000,ffffffff (NUMA socket)
domain4 ffffffff,ffffffff,ffffffff,ffffffff (NUMA platform)
Hope this helps clarify the purpose of this patch series. If no other concerns
going into this direction, I'll split and clean up the patch series per your
initial review comments.
Thanks,
Suravee