Re: [RFC PATCH] futex: Dynamically allocate futex_queues depending on nr_node_ids

From: Peter Zijlstra

Date: Fri Feb 27 2026 - 09:42:21 EST

On Wed, Jan 28, 2026 at 10:13:58AM +0000, K Prateek Nayak wrote:
> CONFIG_NODES_SHIFT (which influences MAX_NUMNODES) is often configured
> generously by distros while the actual number of possible NUMA nodes on
> most systems is often quite conservative.
>
> Instead of reserving MAX_NUMNODES worth of space for futex_queues,
> dynamically allocate it based on "nr_node_ids" at the time of
> futex_init().
>
> "nr_node_ids" at the time of futex_init() is cached as "nr_futex_queues"
> to compensate for the extra dereference necessary to access the elements
> of futex_queues which ends up in a different cacheline now.
>
> Running 5 runs of perf bench futex showed no measurable impact for any
> variants on a dual socket 3rd generation AMD EPYC system (2 x 64C/128T):
>
> variant locking/futex base + patch %diff
> futex/hash 1220783.2 1333296.2 (9%)
> futex/wake 0.71186 0.72584 (2%)
> futex/wake-parallel 0.00624 0.00664 (6%)
> futex/requeue 0.25088 0.26102 (4%)
> futex/lock-pi 57.6 57.8 (0%)
>
> Note: futex/hash had noticeable run to run variance on test machine.
>
> "nr_node_ids" can rarely be larger than num_possible_nodes() but the
> additional space allows for simpler handling of node index in presence
> of sparse node_possible_map.
>
> Reported-by: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
> ---
> Sebastian,
>
> Does this work for your concerns with the large "MAX_NUMNODES" values on
> most distros? It does put the "queues" into a separate cacheline from
> the __futex_data.
>
> The other option is to dynamically allocate the entire __futex_data as:
>
> struct {
> unsigned long hashmask;
> unsigned int hashshift;
> unsigned int nr_queues;
> struct futex_hash_bucket *queues[] __counted_by(nr_queues);
> } *__futex_data __ro_after_init;
>
> with a variable length "queues" at the end if we want to ensure
> everything ends up in the same cacheline but all the __futex_data
> member access would then be pointer dereferencing which might not be
> ideal.
>
> Thoughts?

Both will result in at least one extra deref/cacheline for each futex
op, no?