Re: [PATCH 1/2] x86/CPU/AMD: Present package as die instead of socket

From: Borislav Petkov
Date: Wed Jun 28 2017 - 05:13:41 EST


+ Matt and Mel.

On Wed, Jun 28, 2017 at 03:26:10AM +0700, Suravee Suthikulpanit wrote:
> So, from the definition above, we would like all those 16 threads to be in
> the same sched-domain, where threads from C0,1,2,3 are in the same
> sched-group, and threads in C4,5,6,7 are in another sched-group.

Figures, you want to have a sched group per L3.

> > Now that thing has a memory controller attached to it, correct?
>
> Yes
>
> > If so, why is this thing not a logical NUMA node, as described in
> > SRAT/SLIT?
>
> Yes, this thing is a logical NUMA node and represented correctly in the SRAT/SLIT.
>
> > Now, SRAT should contain the assignment which core belongs to which
> > node. Why is that not sufficient?
>
> Yes, SRAT provides cpu-to-node mapping, which is sufficient to tell
> scheduler what are the cpus within a NUMA node.
>
> However, looking at the current sched-domain below. Notice that there is no
> sched-domain with 16 threads to represent a NUMA node:
>
> cpu0
> domain0 00000000,00000001,00000000,00000001 (SMT)
> domain1 00000000,0000000f,00000000,0000000f (MC)
> domain2 00000000,ffffffff,00000000,ffffffff (NUMA)
> domain3 ffffffff,ffffffff,ffffffff,ffffffff (NUMA)
>
> sched-domain2 (which represents a sched-domain containing all cpus within a
> socket) would have 8 sched-groups (based on the cpumasks from domain1).
> According to the documentation snippet above regarding balancing within a
> sched-domain, scheduler will try to do (NUMA) load-balance between 8 groups
> (spanning 4 NUMA node). Here, IINM, it would be more beneficial if the
> scheduler would try to load balance between the two groups within the same
> NUMA node first before, going across NUMA node in order to minimize memory
> latency. This would require another sched-domain between domain 1 and 2,
> which represent all 16 threads within a NUMA node (i.e. die sched-domain),
> this would allow scheduler to load balance within the NUMA node first,
> before going across NUMA node.
>
> However, since the current code decides that x86_has_numa_in_package is
> true, it omits the die sched-domain. In order to avoid this, we are
> proposing to represent cpuinfo_x86.phys_proc_id using NUMA node ID (i.e. die
> ID). And this is the main point of the patch series.

Ok, so here's what I'm reading from all this: a 16-thread package *is*
represented as a logical NUMA node properly by SRAT/SLIT, after all.
However, sched_init_numa() doesn't parse it properly or something else
along that path keeps the scheduler from seeing the 4 NUMA nodes on the
physical socket.

Because we do use the SLIT table to build the NUMA domains. So
considering that 16T package really is represented properly as a logical
NUMA node, why doesn't it get detected as such?

This is what needs asnwering first and not some forceful fitting the
topology into a DIE domain.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.