[RFC][PATCH] x86, sched: allow topolgies where NUMA nodes share an LLC

From: Dave Hansen
Date: Mon Nov 06 2017 - 17:15:10 EST



From: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>

Intel's Skylake Server CPUs have a different LLC topology than previous
generations. When in Sub-NUMA-Clustering (SNC) mode, the package is
divided into two "slices", each containing half the cores, half the LLC,
and one memory controller and each slice is enumerated to Linux as a
NUMA node. This is similar to how the cores and LLC were arranged
for the Cluster-On-Die (CoD) feature.

CoD allowed the same cache line to be present in each half of the LLC.
But, with SNC, each line is only ever present in *one* slice. This
means that the portion of the LLC *available* to a CPU depends on the
data being accessed:

Remote socket: entire package LLC is shared
Local socket->local slice: data goes into local slice LLC
Local socket->remote slice: data goes into remote-slice LLC. Slightly
higher latency than local slice LLC.

The biggest implication from this is that a process accessing all
NUMA-local memory only sees half the LLC capacity.

The CPU describes its cache hierarchy with the CPUID instruction. One
of the CPUID leaves enumerates the "logical processors sharing this
cache". This information is used for scheduling decisions so that tasks
move more freely between CPUs sharing the cache.

But, the CPUID for the SNC configuration discussed above enumerates
the LLC as being shared by the entire package. This is not 100%
precise because the entire cache is not usable by all accesses. But,
it *is* the way the hardware enumerates itself, and this is not likely
to change.

This breaks the sane_topology() check in the smpboot.c code because
this topology is considered not-sane. To fix this, add a model-
specifc check to never call topology_sane() for these systems. Also,
just like "Cluster-on-Die" we throw out the "coregroup"
sched_domain_topology_level and use NUMA information from the SRAT
alone.

This is OK at least on the hardware we are immediately concerned about
because the LLC sharing happens at both the slice and at the package
level, which are also NUMA boundaries.

This patch eliminates a warning that looks like this:

sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.

Signed-off-by: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
Cc: Luck, Tony <tony.luck@xxxxxxxxx>
Cc: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
Cc: "H. Peter Anvin" <hpa@xxxxxxxxxxxxxxx>
Cc: Borislav Petkov <bp@xxxxxxxxx>
Cc: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
Cc: David Rientjes <rientjes@xxxxxxxxxx>
Cc: Igor Mammedov <imammedo@xxxxxxxxxx>
Cc: Prarit Bhargava <prarit@xxxxxxxxxx>
Cc: Toshi Kani <toshi.kani@xxxxxx>
Cc: brice.goglin@xxxxxxxxx
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
---

b/arch/x86/kernel/smpboot.c | 51 ++++++++++++++++++++++++++++++++++----------
1 file changed, 40 insertions(+), 11 deletions(-)

diff -puN arch/x86/kernel/smpboot.c~x86-numa-nodes-share-llc arch/x86/kernel/smpboot.c
--- a/arch/x86/kernel/smpboot.c~x86-numa-nodes-share-llc 2017-11-06 13:29:49.319087764 -0800
+++ b/arch/x86/kernel/smpboot.c 2017-11-06 13:45:12.902085460 -0800
@@ -77,6 +77,7 @@
#include <asm/i8259.h>
#include <asm/realmode.h>
#include <asm/misc.h>
+#include <asm/intel-family.h>

/* Number of siblings per CPU package */
int smp_num_siblings = 1;
@@ -457,15 +458,50 @@ static bool match_smt(struct cpuinfo_x86
return false;
}

+/*
+ * Set if a package/die has multiple NUMA nodes inside.
+ * AMD Magny-Cours, Intel Cluster-on-Die, and Intel
+ * Sub-NUMA Clustering have this.
+ */
+static bool x86_has_numa_in_package;
+
static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
{
int cpu1 = c->cpu_index, cpu2 = o->cpu_index;

- if (per_cpu(cpu_llc_id, cpu1) != BAD_APICID &&
- per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2))
- return topology_sane(c, o, "llc");
+ /* Do not match if we do not have a valid APICID for cpu: */
+ if (per_cpu(cpu_llc_id, cpu1) == BAD_APICID)
+ return false;
+
+ /* Do not match if LLC id does not match: */
+ if (per_cpu(cpu_llc_id, cpu1) != per_cpu(cpu_llc_id, cpu2))
+ return false;

- return false;
+ /*
+ * Some Intel CPUs enumerate an LLC that is shared by
+ * multiple NUMA nodes. The LLC on these systems is
+ * shared for off-package data acccess but private to the
+ * NUMA node (half of the package) for on-package access.
+ *
+ * CPUID can only enumerate the cache as being shared *or*
+ * unshared, but not this particular configuration. The
+ * CPU in this case enumerates the cache to be shared
+ * across the entire package (spanning both NUMA nodes).
+ */
+ if (!topology_same_node(c, o) &&
+ (c->x86_model == INTEL_FAM6_SKYLAKE_X)) {
+ /* Use NUMA instead of coregroups for scheduling: */
+ x86_has_numa_in_package = true;
+
+ /*
+ * Now, tell the truth, that the LLC matches. But,
+ * note that throwing away coregroups for
+ * scheduling means this will have no actual effect.
+ */
+ return true;
+ }
+
+ return topology_sane(c, o, "llc");
}

/*
@@ -521,12 +557,6 @@ static struct sched_domain_topology_leve
{ NULL, },
};

-/*
- * Set if a package/die has multiple NUMA nodes inside.
- * AMD Magny-Cours and Intel Cluster-on-Die have this.
- */
-static bool x86_has_numa_in_package;
-
void set_cpu_sibling_map(int cpu)
{
bool has_smt = smp_num_siblings > 1;
@@ -553,7 +583,6 @@ void set_cpu_sibling_map(int cpu)

if ((i == cpu) || (has_mp && match_llc(c, o)))
link_mask(cpu_llc_shared_mask, cpu, i);
-
}

/*
_