Re: [RFC PATCH] sched/topology: Introduce NUMA identity node sched domain

From: Suravee Suthikulpanit
Date: Fri Aug 11 2017 - 00:57:25 EST


On 8/10/17 23:41, Peter Zijlstra wrote:
On Thu, Aug 10, 2017 at 10:20:52AM -0500, Suravee Suthikulpanit wrote:
On AMD Family17h-based (EPYC) system, a NUMA node can contain
upto 8 cores (16 threads) with the following topology.

----------------------------
C0 | T0 T1 | || | T0 T1 | C4
--------| || |--------
C1 | T0 T1 | L3 || L3 | T0 T1 | C5
--------| || |--------
C2 | T0 T1 | #0 || #1 | T0 T1 | C6
--------| || |--------
C3 | T0 T1 | || | T0 T1 | C7
----------------------------

Here, there are 2 last-level (L3) caches per NUMA node. A socket can
contain upto 4 NUMA nodes, and a system can support upto 2 sockets.
With full system configuration, current scheduler creates 4 sched
domains:

domain0 SMT (span a core)
domain1 MC (span a last-level-cache)

Right, so traditionally we'd have the DIE level do that, but because
x86_has_numa_in_package we don't generate that, right?

That's correct.


domain2 NUMA (span a socket: 4 nodes)
domain3 NUMA (span a system: 8 nodes)

Note that there is no domain to represent cpus spaning a NUMA node.
With this hierachy of sched domains, the scheduler does not balance
properly in the following cases:

Case1:
When running 8 tasks, a properly balanced system should
schedule a task per NUMA node. This is not the case for
the current scheduler.

Case2:
When running 'taskset -c 0-7 <a_program_with_8_independent_threads>',
a properly balanced system should schedule 8 threads on 8 cpus
(e.g. T0 of C0-C7). However, current scheduler would schedule
some threads on the same cpu, while others are idle.

Sure.. could you amend with a few actual performance numbers?

Sure.

[...]
@@ -1445,9 +1448,24 @@ void sched_init_numa(void)
tl[i] = sched_domain_topology[i];

/*
+ * Ignore the NUMA identity level if it has the same cpumask
+ * as previous level. This is the case for:
+ * - System with last-level-cache (MC) sched domain span a NUMA node.
+ * - System with DIE sched domain span a NUMA node.
+ *
+ * Assume all NUMA nodes are identical, so only check node 0.
+ */
+ if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0)))
+ tl[i++] = (struct sched_domain_topology_level){
+ .mask = sd_numa_mask,
+ .numa_level = 0,
+ SD_INIT_NAME(NUMA_IDEN)

Shall we make that:

SD_INIT_NAME(NODE),

instead?

Sounds good.

+ };

This misses a set of '{}'. While C doesn't require it, out coding style
warrants blocks around any multi-line statement.

So what you've forgotten to mention is that for those systems where the
LLC == NODE this now superfluous level gets removed by the degenerate
code. Have you verified that does the right thing?

Let me check with that one and get back.

Thanks,
Suravee