Re: [RFC PATCH] futex: Dynamically allocate futex_queues depending on nr_node_ids
From: K Prateek Nayak
Date: Wed Feb 25 2026 - 03:53:03 EST
On 2/25/2026 1:09 PM, Sebastian Andrzej Siewior wrote:
> On 2026-02-25 09:06:08 [+0530], K Prateek Nayak wrote:
>> Hello Sebastian,
> Hi Prateek,
>
>> On 2/24/2026 4:43 PM, Sebastian Andrzej Siewior wrote:
>>> My question was initially is 1024 for max-nodes something that people
>>> really use. It was introduced as of
>>> https://lore.kernel.org/all/alpine.DEB.2.00.1003101537330.30724@xxxxxxxxxxxxxxxxxxxxxxxxx/
>>>
>>> but it looks odd. It might just one or two machines which are left :)
>>
>> I have it on good faith that some EPYC user on distro kernels turn on
>> "L3 as NUMA" option which currently results into 32 NUMA nodes on our
>> largest configuration.
>>
>> Adding a little bit more margin for CXL nodes should make even
>> CONFIG_NODES_SHIFT=6 pretty sane default for most real-word configs.
>> I don't think we can go more than 10 or so CXL nodes considering the
>> number of PCIe lanes unless there are more creative ways to attach
>> tiered memory that appear as a NUMA node.
>>
>> I'm not sure if Intel has similar crazy combination but NODES_SHIFT=6
>> can accommodate (16 socket * SNC-3) + up to 16 CXL nodes so it should
>> be fine for most distro users too?
>
> Okay. According to Kconfig, this is the default for X86_64. The 10 gets
> set by MAXSMP. This option raises the NR_CPUS_DEFAULT to 8192. That
> might the overkill. What would be a sane value for NR_CPUS_DEFAULT?
I would have thought a quarter of that would be plenty but looking at
the footnote in [1] that says "16 socket GNR system" and the fact that
GNR can feature up to 256 threads per socket - that could theoretically
put such systems at that NR_CPUS_DEFAULT limit - I don't know if it is
practically possible.
[1] https://lore.kernel.org/lkml/aYPjOgiO_XsFWnWu@xxxxxxx/
Still, I doubt such setup would practically cross more than 64 nodes.
Why was this selected as the default for MAXSMP? It came from [2] but
I'm not really able to understand why other than this line in Mike's
response:
"MAXSMP" represents what's really usable
so we just set it to the max of range to test for scalability? Seems
little impractical for real-world cases but on the flip side if we
don't sit it, some bits might not get enough testing?
[2] https://lore.kernel.org/lkml/20080326014137.934171000@xxxxxxxxxxxxxxxxxxxxxxxxxx/
>I don't have anything that exceeds 3 digits but I also don't have anything
> with more than 4 nodes ;)
And mine tops out at 32 nodes ;-)
--
Thanks and Regards,
Prateek