Re: [RFC PATCH] futex: Dynamically allocate futex_queues depending on nr_node_ids

From: K Prateek Nayak

Date: Fri Feb 27 2026 - 03:47:59 EST

Hey Sebastian,

Sorry for the delay!

On 2/25/2026 2:52 PM, Sebastian Andrzej Siewior wrote:
> On 2026-02-25 14:21:33 [+0530], K Prateek Nayak wrote:
> Hi Prateek,
>
>> I would have thought a quarter of that would be plenty but looking at
>> the footnote in [1] that says "16 socket GNR system" and the fact that
>> GNR can feature up to 256 threads per socket - that could theoretically
>> put such systems at that NR_CPUS_DEFAULT limit - I don't know if it is
>> practically possible.
>>
>> [1] https://lore.kernel.org/lkml/aYPjOgiO_XsFWnWu@xxxxxxx/
>>
>> Still, I doubt such setup would practically cross more than 64 nodes.
>
> I am still trying to figure out if this is practical or some drunk guys
> saying "you know what would be fun?"
>
>> Why was this selected as the default for MAXSMP? It came from [2] but
>> I'm not really able to understand why other than this line in Mike's
>> response:
>>
>> "MAXSMP" represents what's really usable
>>
>> so we just set it to the max of range to test for scalability? Seems
>> little impractical for real-world cases but on the flip side if we
>> don't sit it, some bits might not get enough testing?
>
> Sounds like it. What would be sane default upper limit then? Something
> like 1024 CPUs? 2048? Or even more than that?

I feel the current default for NR_CPUS can be be retained as is just to
be on the safer side.

Turns out QEMU allows for a ridiculous amount of vCPUs per guest and I've
found enough evidence of extremely large guests running oversubscribed
that sometimes run distro kernels :-(

>
> I would try to use this and convince Debian to drop MAXSMP and then
> lower NODES_SHIFT to default 6. I would need a default for
> NR_CPUS_DEFAULT without having people complaining about missing CPUs.
> Maybe we could get a sane default setting in kernel without testing
> limits.

*Theoretically* with SNC-3 and 16 sockets + CXL we can get close to the
!MAXSMP limits for NODES_SHIFT (6) so perhaps we should drop it down a
couple of notch from 10 as far as defaults are concerned to 8 - that
should give us ample room for a long time in my opinion.

Folks who are doing *insane* NUMA emulation can perhaps explain the use
case or resort to building a kernel with a non-default NODES_SHIFT.

>
> Also probably will compile two kernels to see how much memory this safes
> in total since there should be other data structures depending on max
> CPUs/ NODEs.

To keep the configs as close as possible, I had to resort to selecting
CONFIG_CPUMASK_OFFSTACK for !MAXSMP. Following was bloat-o-meter output
with the reduced NODES_SHIFT on kernels built with very close to Ubuntu
distro config:

o NODES_SHIFT=8 : Total: Before=33017117, After=32109495, chg -2.75%
o NODES_SHIFT=6 : Total: Before=33017117, After=31930101, chg -3.29%
o NODES_SHIFT=6; NR_CPUS=4k : Total: Before=33017117, After=31196664, chg -5.51%
o NODES_SHIFT=6; NR_CPUS=2k : Total: Before=33017117, After=30829862, chg -6.62%

That last couple configs adds ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP. If I
remove that dependency , I don't really see any change to the
bloat-o-meter results so I don't think it makes much of a difference.

Runtime memory consumption difference are within the noise range for
me - I really couldn't see anything meaningful difference (or even a
trend with multiple runs) between the extreme configs after boot. I
haven't done any meaningful longer testing to pot anything.

I'll let you decide what is a good trade off between space saving and
future headaches :-)

--
Thanks and Regards,
Prateek