Re: [RFC PATCH] futex: Dynamically allocate futex_queues depending on nr_node_ids

From: Sebastian Andrzej Siewior

Date: Tue Feb 24 2026 - 06:14:48 EST

On 2026-01-28 10:13:58 [+0000], K Prateek Nayak wrote:
> CONFIG_NODES_SHIFT (which influences MAX_NUMNODES) is often configured
> generously by distros while the actual number of possible NUMA nodes on
> most systems is often quite conservative.
>
> Instead of reserving MAX_NUMNODES worth of space for futex_queues,
> dynamically allocate it based on "nr_node_ids" at the time of
> futex_init().
>
> "nr_node_ids" at the time of futex_init() is cached as "nr_futex_queues"
> to compensate for the extra dereference necessary to access the elements
> of futex_queues which ends up in a different cacheline now.

With the Debian config CONFIG_NODES_SHIFT is set to 10 as of
6.18.12+deb14 for amd64 probably due to MAXSMP.

> Running 5 runs of perf bench futex showed no measurable impact for any
> variants on a dual socket 3rd generation AMD EPYC system (2 x 64C/128T):
>
> variant locking/futex base + patch %diff
> futex/hash 1220783.2 1333296.2 (9%)
> futex/wake 0.71186 0.72584 (2%)
> futex/wake-parallel 0.00624 0.00664 (6%)
> futex/requeue 0.25088 0.26102 (4%)
> futex/lock-pi 57.6 57.8 (0%)
>
> Note: futex/hash had noticeable run to run variance on test machine.

so we are getting slightly worse?

> "nr_node_ids" can rarely be larger than num_possible_nodes() but the
> additional space allows for simpler handling of node index in presence
> of sparse node_possible_map.
>
> Reported-by: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
> ---
> Sebastian,
>
> Does this work for your concerns with the large "MAX_NUMNODES" values on
> most distros? It does put the "queues" into a separate cacheline from
> the __futex_data.

I didn't want to do this because now we have two pointers to resolve,
nr_node_ids vs nr_futex_queues should be largely the same. And I *think*
the kernel image is mapped interleaved while the kcalloc() is from the
current node (mostly #1). Having the huge array does not create any
runtime overhead, it is just that we allocate 8KiB of memory here while
32 bytes for the 4 average nodes should be just fine. At least this is
my imagination that 4 nodes the average upper limit.

My question was initially is 1024 for max-nodes something that people
really use. It was introduced as of
https://lore.kernel.org/all/alpine.DEB.2.00.1003101537330.30724@xxxxxxxxxxxxxxxxxxxxxxxxx/

but it looks odd. It might just one or two machines which are left :)

> The other option is to dynamically allocate the entire __futex_data as:
>
> struct {
> unsigned long hashmask;
> unsigned int hashshift;
> unsigned int nr_queues;
> struct futex_hash_bucket *queues[] __counted_by(nr_queues);
> } *__futex_data __ro_after_init;
>
> with a variable length "queues" at the end if we want to ensure
> everything ends up in the same cacheline but all the __futex_data
> member access would then be pointer dereferencing which might not be
> ideal.

Here we would have also two pointers and I don't think it is worth it.

> Thoughts?

Having a statement that these machines are in the minority and not used
by a wider range of people might convince Debian to lower the default. I
haven't look into other distros but the MAXSMP on x86 will probably
force the 10 there, too.
Especially if *those* machines are used only by Google/ Amazon/ Oracle
and they use their own kernel and not the Debian one. Maybe it would
work to hide it behind MAXNUMA and keep the default for x86 at 6.
Looking around, the range is also 1…10 on arm64 and riscv, too. Looking
into the configs I see

| boot/config-6.18.12+deb14-arm64:CONFIG_NODES_SHIFT=4
| boot/config-6.18.12+deb14-arm64-16k:CONFIG_NODES_SHIFT=4
| boot/config-6.18.12+deb14-loong64:CONFIG_NODES_SHIFT=6
| boot/config-6.18.12+deb14-powerpc64le:CONFIG_NODES_SHIFT=8
| boot/config-6.18.12+deb14-powerpc64le-64k:CONFIG_NODES_SHIFT=8
| boot/config-6.18.12+deb14-riscv64:CONFIG_NODES_SHIFT=2

While most look sane, loong64 looks odd as in an architecture this young
already having 64 nodes by default. Not sure how much of this copy/
paste and how much is actual need.

Sebastian