Re: [RFC PATCH] sched/topology: Introduce NUMA identity node sched domain
From: Peter Zijlstra
Date: Thu Aug 10 2017 - 12:42:00 EST
On Thu, Aug 10, 2017 at 10:20:52AM -0500, Suravee Suthikulpanit wrote:
> On AMD Family17h-based (EPYC) system, a NUMA node can contain
> upto 8 cores (16 threads) with the following topology.
>
> ----------------------------
> C0 | T0 T1 | || | T0 T1 | C4
> --------| || |--------
> C1 | T0 T1 | L3 || L3 | T0 T1 | C5
> --------| || |--------
> C2 | T0 T1 | #0 || #1 | T0 T1 | C6
> --------| || |--------
> C3 | T0 T1 | || | T0 T1 | C7
> ----------------------------
>
> Here, there are 2 last-level (L3) caches per NUMA node. A socket can
> contain upto 4 NUMA nodes, and a system can support upto 2 sockets.
> With full system configuration, current scheduler creates 4 sched
> domains:
>
> domain0 SMT (span a core)
> domain1 MC (span a last-level-cache)
Right, so traditionally we'd have the DIE level do that, but because
x86_has_numa_in_package we don't generate that, right?
> domain2 NUMA (span a socket: 4 nodes)
> domain3 NUMA (span a system: 8 nodes)
>
> Note that there is no domain to represent cpus spaning a NUMA node.
> With this hierachy of sched domains, the scheduler does not balance
> properly in the following cases:
>
> Case1:
> When running 8 tasks, a properly balanced system should
> schedule a task per NUMA node. This is not the case for
> the current scheduler.
>
> Case2:
> When running 'taskset -c 0-7 <a_program_with_8_independent_threads>',
> a properly balanced system should schedule 8 threads on 8 cpus
> (e.g. T0 of C0-C7). However, current scheduler would schedule
> some threads on the same cpu, while others are idle.
Sure.. could you amend with a few actual performance numbers?
> Introducing NUMA identity node sched domain, which is based on how
> SRAT/SLIT table define a NUMA node. This results in the following
> hierachy of sched domains on the same system described above.
>
> domain0 SMT (span a core)
> domain1 MC (span a last-level-cache)
> domain2 NUMA_IDEN (span a NUMA node)
Hate that name though..
> domain3 NUMA (span a socket: 4 nodes)
> domain4 NUMA (span a system: 8 nodes)
>
> This fixes the improper load balancing cases mentioned above.
>
> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@xxxxxxx>
> ---
> kernel/sched/topology.c | 24 +++++++++++++++++++++---
> 1 file changed, 21 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 79895ae..c57df98 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1335,6 +1335,10 @@ void sched_init_numa(void)
> if (!sched_domains_numa_distance)
> return;
>
> + /* Includes NUMA identity node at level 0. */
> + sched_domains_numa_distance[level++] = curr_distance;
> + sched_domains_numa_levels = level;
> +
> /*
> * O(nr_nodes^2) deduplicating selection sort -- in order to find the
> * unique distances in the node_distance() table.
> @@ -1382,8 +1386,7 @@ void sched_init_numa(void)
> return;
>
> /*
> - * 'level' contains the number of unique distances, excluding the
> - * identity distance node_distance(i,i).
> + * 'level' contains the number of unique distances
> *
> * The sched_domains_numa_distance[] array includes the actual distance
> * numbers.
> @@ -1445,9 +1448,24 @@ void sched_init_numa(void)
> tl[i] = sched_domain_topology[i];
>
> /*
> + * Ignore the NUMA identity level if it has the same cpumask
> + * as previous level. This is the case for:
> + * - System with last-level-cache (MC) sched domain span a NUMA node.
> + * - System with DIE sched domain span a NUMA node.
> + *
> + * Assume all NUMA nodes are identical, so only check node 0.
> + */
> + if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0)))
> + tl[i++] = (struct sched_domain_topology_level){
> + .mask = sd_numa_mask,
> + .numa_level = 0,
> + SD_INIT_NAME(NUMA_IDEN)
Shall we make that:
SD_INIT_NAME(NODE),
instead?
> + };
This misses a set of '{}'. While C doesn't require it, out coding style
warrants blocks around any multi-line statement.
So what you've forgotten to mention is that for those systems where the
LLC == NODE this now superfluous level gets removed by the degenerate
code. Have you verified that does the right thing?
> +
> + /*
> * .. and append 'j' levels of NUMA goodness.
> */
> - for (j = 0; j < level; i++, j++) {
> + for (j = 1; j < level; i++, j++) {
> tl[i] = (struct sched_domain_topology_level){
> .mask = sd_numa_mask,
> .sd_flags = cpu_numa_flags,
> --
> 2.7.4
>