Re: [RFC][PATCH] x86, sched: allow topolgies where NUMA nodes share an LLC

From: Peter Zijlstra
Date: Wed Nov 08 2017 - 04:31:31 EST

Next message: Dou Liyang: "[PATCH 0/2] Clean up for tsc_init()"
Previous message: Linus Walleij: "Re: [PATCH V13 08/10] mmc: block: blk-mq: Separate card polling from recovery"
Next in thread: Dave Hansen: "Re: [RFC][PATCH] x86, sched: allow topolgies where NUMA nodes share an LLC"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Nov 07, 2017 at 08:22:19AM -0800, Dave Hansen wrote:
> On 11/07/2017 12:30 AM, Peter Zijlstra wrote:
> > On Mon, Nov 06, 2017 at 02:15:00PM -0800, Dave Hansen wrote:
> >
> >> But, the CPUID for the SNC configuration discussed above enumerates
> >> the LLC as being shared by the entire package. This is not 100%
> >> precise because the entire cache is not usable by all accesses. But,
> >> it *is* the way the hardware enumerates itself, and this is not likely
> >> to change.
> >
> > So CPUID and SRAT will remain inconsistent; even in future products?
> > That would absolutely blow chunks.
>
> It certainly isn't ideal as it stands. If it was changed, what would it
> be changed to? You can not even represent the current L3 topology in
> CPUID, at least not precisely.
>
> I've been arguing we should optimize the CPUID information for
> performance. Right now, it's suboptimal for folks doing NUMA-local
> allocations, and I think that's precisely the group of folks that needs
> precise information. I'm trying to get it changed going forward.

So this SNC situation is indeed very intricate and cannot be accurately
represented in CPUID. In fact its decidedly complex with matching
complex performance characteristics (i suspect).

People doing NUMA-local stuff care about performance (otherwise they'd
not bother dealing with the NUMA stuff to begin with); however there are
plenty people mostly ignoring small NUMA because the NUMA factor is
fairly low these days.

And SNC makes it even smaller; it effectively puts a cache in between
the two on-die nodes; not entirely unlike the s390 BOOK domain. Which
makes ignoring NUMA even more tempting.

What does this topology approach do for those workloads?

> > If that is the case, we'd best use a fake feature like
> > X86_BUG_TOPOLOGY_BROKEN and use that instead of an ever growing list of
> > models in this code.
>
> FWIW, I don't consider the current situation broken. Nobody ever
> promised the kernel that a NUMA node would never happen inside a socket,
> or inside a cache boundary enumerated in CPUID.

Its the last that I really find dodgy...

> The assumptions the kernel made were sane, but the CPU's description of
> itself, *and* the BIOS-provided information are also sane. But, the
> world changed, some of those assumptions turned out to be wrong, and
> somebody needs to adjust.

The thing is; this patch effectively says CPUID *is* wrong. It says we
only consider NUMA local cache slices, since that is all that is
available to the local cores.

So what does it mean for cores to share a cache? What does CPUID
describe?

Should we not look at it as a node local L3 with a socket wide L3.5
(I'm not calling it L4 to avoid confusion with the Crystal Well iGPU
stuff) which happens to be tightly coupled? Your statement that the
local slice is faster seems to support that view.

In that case CPUID really is wrong; L3 should be node local. And the
L3.5 should be represented in the NUMA topology (SLIT) as a lower factor
between the two nodes.

Is this closeness between the nodes appropriately described by current
SLIT tables? Can I get: cat /sys/devices/system/node/node*/distance
from a machine that has this enabled?

> >> + /* Use NUMA instead of coregroups for scheduling: */
> >> + x86_has_numa_in_package = true;
> >> +
> >> + /*
> >> + * Now, tell the truth, that the LLC matches. But,
> >> + * note that throwing away coregroups for
> >> + * scheduling means this will have no actual effect.
> >> + */
> >> + return true;
> >
> > What are the ramifications here? Is anybody else using that cpumask
> > outside of the scheduler topology setup?
>
> I looked for it and didn't see anything else. I'll double check that
> nothing has popped up since I hacked this together.

Lets put it another way; is there a sane use-case for the multi-node
spanning coregroups thing? It seems inconsistent at best; with the
L3/L3.5 view these cores do not in fact share a cache (although their
caches are 'close').

So I would suggest making that return false and be consistent; esp. if
there are no other users, this doesn't 'cost' anything.

Next message: Dou Liyang: "[PATCH 0/2] Clean up for tsc_init()"
Previous message: Linus Walleij: "Re: [PATCH V13 08/10] mmc: block: blk-mq: Separate card polling from recovery"
Next in thread: Dave Hansen: "Re: [RFC][PATCH] x86, sched: allow topolgies where NUMA nodes share an LLC"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]