Re: [RFC][PATCH] x86, sched: allow topolgies where NUMA nodes share an LLC

From: Dave Hansen
Date: Wed Nov 08 2017 - 19:01:08 EST


On 11/08/2017 01:31 AM, Peter Zijlstra wrote:
> And SNC makes it even smaller; it effectively puts a cache in between
> the two on-die nodes; not entirely unlike the s390 BOOK domain. Which
> makes ignoring NUMA even more tempting.
>
> What does this topology approach do for those workloads?

What does this L3 topology do for workloads ignoring NUMA?

Let's just assume that an app is entirely NUMA unaware and that it is
accessing a large amount of memory entirely uniformly across the entire
system. Let's also say that we just have a 2-socket system which now
shows up as having 4 NUMA nodes (one per slice, two slices per socket,
two sockets). Let's also just say we have 20MB of L3 per socket, so
10MB per slice.

- 1/4 of the memory accesses will be local to the slice and will have
access to 10MB of L3.
- 1/4 of the memory accesses will be to the *other* slice and will have
access to 10MB of L3 (non-conflicting with the previous 10MB). This
access is marginally slower than the access to the local slice.
- 1/2 of memory accesses will be cross-node and will have access to
20MB of L3 (both slices' L3's).

That's all OK. Without this halved-L3 (the previous Cluster-on-Die)
configuration, it looked like this:

- 1/2 of the memory accesses will be local to the socket and have
access to 20MB of L3.
- 1/2 of memory accesses will be cross-node and will have access to
20MB of L3 (both slices' L3's).

I'd argue that those two end up looking pretty much the same to an app.
The only difference is that the slice-local and slice-remote cache hits
have slightly different access latencies. I don't think it's enough to
notice.

The place where it is not optimal is where an app does NUMA-local
accesses, then sees that it has 20MB of L3 (via CPUID) and expects to
*get* 20MB of L3.