[PATCH v2] x86,sched: allow topologies where NUMA nodes share an LLC

From: Alison Schofield
Date: Thu Mar 22 2018 - 16:49:02 EST


From: Alison Schofield <alison.schofield@xxxxxxxxx>

Intel's Skylake Server CPUs have a different LLC topology than previous
generations. When in Sub-NUMA-Clustering (SNC) mode, the package is
divided into two "slices", each containing half the cores, half the LLC,
and one memory controller and each slice is enumerated to Linux as a
NUMA node. This is similar to how the cores and LLC were arranged
for the Cluster-On-Die (CoD) feature.

CoD allowed the same cache line to be present in each half of the LLC.
But, with SNC, each line is only ever present in *one* slice. This
means that the portion of the LLC *available* to a CPU depends on the
data being accessed:

Remote socket: entire package LLC is shared
Local socket->local slice: data goes into local slice LLC
Local socket->remote slice: data goes into remote-slice LLC. Slightly
higher latency than local slice LLC.

The biggest implication from this is that a process accessing all
NUMA-local memory only sees half the LLC capacity.

The CPU describes its cache hierarchy with the CPUID instruction. One
of the CPUID leaves enumerates the "logical processors sharing this
cache". This information is used for scheduling decisions so that tasks
move more freely between CPUs sharing the cache.

But, the CPUID for the SNC configuration discussed above enumerates
the LLC as being shared by the entire package. This is not 100%
precise because the entire cache is not usable by all accesses. But,
it *is* the way the hardware enumerates itself, and this is not likely
to change.

This breaks the sane_topology() check in the smpboot.c code because
this topology is considered not-sane. To fix this, add a vendor and
model specifc check to never call topology_sane() for these systems.
Also, just like "Cluster-on-Die" we throw out the "coregroup"
sched_domain_topology_level and use NUMA information from the SRAT
alone.

This is OK at least on the hardware we are immediately concerned about
because the LLC sharing happens at both the slice and at the package
level, which are also NUMA boundaries.

This patch eliminates a warning that looks like this:

sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.

Signed-off-by: Alison Schofield <alison.schofield@xxxxxxxxx>
Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
Cc: Luck, Tony <tony.luck@xxxxxxxxx>
Cc: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
Cc: "H. Peter Anvin" <hpa@xxxxxxxxxxxxxxx>
Cc: Borislav Petkov <bp@xxxxxxxxx>
Cc: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
Cc: David Rientjes <rientjes@xxxxxxxxxx>
Cc: Igor Mammedov <imammedo@xxxxxxxxxx>
Cc: Prarit Bhargava <prarit@xxxxxxxxxx>
Cc: brice.goglin@xxxxxxxxx
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
---

Changes in v2:

* Add vendor check (Intel) where we only had a model check (Skylake_X).
Considered the suggestion of adding a new flag here but thought that
to be overkill for this usage.

* Return false, instead of true, from match_llc() per reviewer suggestion.
That also cleaned up a topology broken bug message in sched_domain().

* Updated the comments around the match_llc() return value, continuing to
note that the return value doesn't actually matter because we are throwing
away coregroups for scheduling.

* Changed submitter. I'm following up on this patch originally submitted
by Dave Hansen


arch/x86/kernel/smpboot.c | 52 ++++++++++++++++++++++++++++++++++++++---------
1 file changed, 42 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index ff99e2b..cffd181 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -77,6 +77,7 @@
#include <asm/i8259.h>
#include <asm/misc.h>
#include <asm/qspinlock.h>
+#include <asm/intel-family.h>

/* Number of siblings per CPU package */
int smp_num_siblings = 1;
@@ -390,15 +391,52 @@ static bool match_smt(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
return false;
}

+/*
+ * Set if a package/die has multiple NUMA nodes inside.
+ * AMD Magny-Cours, Intel Cluster-on-Die, and Intel
+ * Sub-NUMA Clustering have this.
+ */
+static bool x86_has_numa_in_package;
+
static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
{
int cpu1 = c->cpu_index, cpu2 = o->cpu_index;

- if (per_cpu(cpu_llc_id, cpu1) != BAD_APICID &&
- per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2))
- return topology_sane(c, o, "llc");
+ /* Do not match if we do not have a valid APICID for cpu: */
+ if (per_cpu(cpu_llc_id, cpu1) == BAD_APICID)
+ return false;

- return false;
+ /* Do not match if LLC id does not match: */
+ if (per_cpu(cpu_llc_id, cpu1) != per_cpu(cpu_llc_id, cpu2))
+ return false;
+
+ /*
+ * Some Intel CPUs enumerate an LLC that is shared by
+ * multiple NUMA nodes. The LLC on these systems is
+ * shared for off-package data access but private to the
+ * NUMA node (half of the package) for on-package access.
+ *
+ * CPUID can only enumerate the cache as being shared *or*
+ * unshared, but not this particular configuration. The
+ * CPU in this case enumerates the cache to be shared
+ * across the entire package (spanning both NUMA nodes).
+ */
+ if (!topology_same_node(c, o) &&
+ (c->x86_vendor == X86_VENDOR_INTEL &&
+ c->x86_model == INTEL_FAM6_SKYLAKE_X)) {
+ /* Use NUMA instead of coregroups for scheduling: */
+ x86_has_numa_in_package = true;
+
+ /*
+ * Return value doesn't actually matter because we
+ * are throwing away coregroups for scheduling anyway.
+ * Return false to bypass topology broken bug messages
+ * and fixups in sched_domain().
+ */
+ return false;
+ }
+
+ return topology_sane(c, o, "llc");
}

/*
@@ -454,12 +492,6 @@ static struct sched_domain_topology_level x86_topology[] = {
{ NULL, },
};

-/*
- * Set if a package/die has multiple NUMA nodes inside.
- * AMD Magny-Cours and Intel Cluster-on-Die have this.
- */
-static bool x86_has_numa_in_package;
-
void set_cpu_sibling_map(int cpu)
{
bool has_smt = smp_num_siblings > 1;
--
2.7.4