Re: [PATCH v3] x86,sched: allow topologies where NUMA nodes share an LLC
From: Alison Schofield
Date: Fri Mar 30 2018 - 13:33:45 EST
On Wed, Mar 28, 2018 at 05:00:24PM -0700, Alison Schofield wrote:
> From: Alison Schofield <alison.schofield@xxxxxxxxx>
>
> Intel's Skylake Server CPUs have a different LLC topology than previous
> generations. When in Sub-NUMA-Clustering (SNC) mode, the package is
> divided into two "slices", each containing half the cores, half the LLC,
> and one memory controller and each slice is enumerated to Linux as a
> NUMA node. This is similar to how the cores and LLC were arranged
> for the Cluster-On-Die (CoD) feature.
>
> CoD allowed the same cache line to be present in each half of the LLC.
> But, with SNC, each line is only ever present in *one* slice. This
> means that the portion of the LLC *available* to a CPU depends on the
> data being accessed:
>
> Remote socket: entire package LLC is shared
> Local socket->local slice: data goes into local slice LLC
> Local socket->remote slice: data goes into remote-slice LLC. Slightly
> higher latency than local slice LLC.
>
> The biggest implication from this is that a process accessing all
> NUMA-local memory only sees half the LLC capacity.
>
> The CPU describes its cache hierarchy with the CPUID instruction. One
> of the CPUID leaves enumerates the "logical processors sharing this
> cache". This information is used for scheduling decisions so that tasks
> move more freely between CPUs sharing the cache.
>
> But, the CPUID for the SNC configuration discussed above enumerates
> the LLC as being shared by the entire package. This is not 100%
> precise because the entire cache is not usable by all accesses. But,
> it *is* the way the hardware enumerates itself, and this is not likely
> to change.
>
> This breaks the sane_topology() check in the smpboot.c code because
> this topology is considered not-sane. To fix this, add a vendor and
> model specific check to never call topology_sane() for these systems.
> Also, just like "Cluster-on-Die" we throw out the "coregroup"
> sched_domain_topology_level and use NUMA information from the SRAT
> alone.
>
> This is OK at least on the hardware we are immediately concerned about
> because the LLC sharing happens at both the slice and at the package
> level, which are also NUMA boundaries.
>
> This patch eliminates a warning that looks like this:
>
> sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
>
Let's see if I'm getting a better grasp of this:
My goal here is to suppress that WARNING message from topology_sane().
(We have a customer who is seeing the WARNING and would like it to go away)
The sysfs exported info for SNC's is 'not precise'. It reports the entire
LLC as available. This 'non precise' data existed before this patch and
exists after this patch. This is a problem, agreed.
PeterZ:
At first I thought you were saying that this patch itself broke the
sysfs info. I experimented with that and found no differences in sysfs
info before/after the patch and with SNC on/off. That makes me think
you are saying that we should not say this topology is 'Allowed' when
the sysfs data is wrong. (ie. That WARNING serves a purpose)
If you did indeed mean that the patch breaks the sysfs data, please
point me real close! ie. How does it change the cache-mask as exposed
to userspace?
All:
Here are 3 alternatives:
1) Keep patch code basically the same and improve the comments & commit,
being very explicit about the sysfs info issue.
2) Change the way the LLC-size is reported.
Enumerate "two separate, half-sized LLCs shared only by the slice when
SNC mode is on."
3) Do not export the sysfs info that is wrong. Userspace cannot make bad
decisions based upon it.
Can we do 1) now and then follow with 2) or 3)?
Thanks for all the reviews/comments,
alisons