Re: [RFC PATCH] futex: Dynamically allocate futex_queues depending on nr_node_ids

From: K Prateek Nayak

Date: Fri Feb 27 2026 - 10:07:37 EST


Hello Peter,

On 2/27/2026 8:12 PM, Peter Zijlstra wrote:
> On Wed, Jan 28, 2026 at 10:13:58AM +0000, K Prateek Nayak wrote:
>> CONFIG_NODES_SHIFT (which influences MAX_NUMNODES) is often configured
>> generously by distros while the actual number of possible NUMA nodes on
>> most systems is often quite conservative.
>>
>> Instead of reserving MAX_NUMNODES worth of space for futex_queues,
>> dynamically allocate it based on "nr_node_ids" at the time of
>> futex_init().
>>
>> "nr_node_ids" at the time of futex_init() is cached as "nr_futex_queues"
>> to compensate for the extra dereference necessary to access the elements
>> of futex_queues which ends up in a different cacheline now.
>>
>> Running 5 runs of perf bench futex showed no measurable impact for any
>> variants on a dual socket 3rd generation AMD EPYC system (2 x 64C/128T):
>>
>> variant locking/futex base + patch %diff
>> futex/hash 1220783.2 1333296.2 (9%)
>> futex/wake 0.71186 0.72584 (2%)
>> futex/wake-parallel 0.00624 0.00664 (6%)
>> futex/requeue 0.25088 0.26102 (4%)
>> futex/lock-pi 57.6 57.8 (0%)
>>
>> Note: futex/hash had noticeable run to run variance on test machine.
>>
>> "nr_node_ids" can rarely be larger than num_possible_nodes() but the
>> additional space allows for simpler handling of node index in presence
>> of sparse node_possible_map.
>>
>> Reported-by: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
>> ---
>> Sebastian,
>>
>> Does this work for your concerns with the large "MAX_NUMNODES" values on
>> most distros? It does put the "queues" into a separate cacheline from
>> the __futex_data.
>>
>> The other option is to dynamically allocate the entire __futex_data as:
>>
>> struct {
>> unsigned long hashmask;
>> unsigned int hashshift;
>> unsigned int nr_queues;
>> struct futex_hash_bucket *queues[] __counted_by(nr_queues);
>> } *__futex_data __ro_after_init;
>>
>> with a variable length "queues" at the end if we want to ensure
>> everything ends up in the same cacheline but all the __futex_data
>> member access would then be pointer dereferencing which might not be
>> ideal.
>>
>> Thoughts?
>
> Both will result in at least one extra deref/cacheline for each futex
> op, no?

Ack but I was wondering if that penalty can be offset by the fact that
we no longer need to look at "nr_node_ids" in a separate cacheline?

I ran futex bench enough time before posting to come to conclusion that
there isn't any noticeable regression - the numbers swung either ways
and I just took one set for comparison.

Sebastian and I have been having a more philosophical discussion on that
CONFIG_NODES_SHIFT default but I guess as far as this patch is concerned,
the conclusion is we want to avoid an extra dereference in the fast-path
at the cost of a little bit extra space?

--
Thanks and Regards,
Prateek