Re: [PATCH] x86, sched: Allow NUMA nodes to share an LLC on Intel platforms

From: Dave Hansen
Date: Wed Feb 10 2021 - 12:42:51 EST


On 2/10/21 12:10 AM, Peter Zijlstra wrote:
> On Tue, Feb 09, 2021 at 11:09:27PM +0000, Luck, Tony wrote:
>>> +#define X86_BUG_NUMA_SHARES_LLC X86_BUG(25) /* CPU may enumerate an LLC shared by multiple NUMA nodes */
>>
>> During internal review I wondered why this is a "BUG" rather than a "FEATURE" bit.
>>
>> Apparently, the suggestion for "BUG" came from earlier community discussions.
>>
>> Historically it may have seemed reasonable to say that a cache cannot span
>> NUMA domains. But with more and more things moving off the motherboard
>> and into the socket, this doesn't seem too weird now.
>
> If you look at the details this SNC LLC span doesn't behave quite right
> either.

Yes, the rules are weird. I came to the conclusion that there's no
precise way to enumerate these rules with the existing CPUID-based cache
enumeration.

I can send you my powerpoint slides. ;)

> It really isn't a regular cache, but behaves a bit like a mash-up of the
> s390 book caches and a normal LLC.
>
> Did anybody play with adding the book domain to these SNC
> configurations?

Nope. Probably mostly because we don't have a great way of generating it.

For those playing along at home, I think Peter is talking about this:

static struct sched_domain_topology_level s390_topology[] = {
{ cpu_thread_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
{ cpu_book_mask, SD_INIT_NAME(BOOK) },
{ cpu_drawer_mask, SD_INIT_NAME(DRAWER) },
{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
{ NULL, },
};

>From arch/s390/kernel/topology.c

> Can we detect SNC other than by this quirk?

I'm sure there's _a_ way, but nothing that's architectural. The kernel
has literally been given all the information about the topology that it
needs from the CPU and the firmware. The problem is that that
information resembles garbage that the kernel has been presented with in
the past.

I guess you're saying that it would be nice to have some other bit of
info that the kernel can use to boost its confidence that the
hardware/bios are being sane.