Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL
From: Sebastian Andrzej Siewior
Date: Wed Mar 26 2025 - 04:51:29 EST
On 2025-03-18 18:54:22 [+0530], Shrikanth Hegde wrote:
> I tried this in one of our systems(Single NUMA, 80 CPUs), I see significant reduction in futex/hash.
> Maybe i am missing some config or doing something stupid w.r.t to benchmarking.
> I am trying to understand this stuff.
>
> I ran "perf bench futex all" as is. No change has been made to perf.
> =========================================
> Without patch: at 6575d1b4a6ef3336608127c704b612bc5e7b0fdc
> # Running futex/hash benchmark...
> Run summary [PID 45758]: 80 threads, each operating on 1024 [private] futexes for 10 secs.
> Averaged 1556023 operations/sec (+- 0.08%), total secs = 10 <<--- 1.5M
>
> =========================================
> With the Series: I had to make PR_FUTEX_HASH=78 since 77 is used for TIMERs.
>
> # Running futex/hash benchmark...
> Run summary [PID 8644]: 80 threads, each operating on 1024 [private] futexes for 10 secs.
> Averaged 150382 operations/sec (+- 0.42%), total secs = 10 <<-- 0.15M, close to 10x down.
>
> =========================================
>
> Did try a git bisect based on the futex/hash numbers. It narrowed it to this one.
> first bad commit: [5dc017a816766be47ffabe97b7e5f75919756e5c] futex: Allow automatic allocation of process wide futex hash.
>
> Is this expected given the complexity of hash function change?
So with 80 CPUs/ threads you should end up with roundup_pow_of_two(80 *
4) = 512 buckets. Before the series you should have
roundup_pow_of_two(80 * 256) = 32768 buckets. This is also printed at
boot.
_Now_ you have less buckets so a hash collision is more likely to
happen. To get to the old numbers you would have increase the buckets
and you get the same results. I benchmark a few things at
https://lore.kernel.org/all/20241101110810.R3AnEqdu@xxxxxxxxxxxxx/
This looks like the series makes it worse. But then those buckets are
per-task so you won't collide with a different task. This in turn should
relax the situation as a whole because different tasks can't block each
other. If two threads block on the same bucket then they might use the
same `uaddr'.
The benchmark measures how many hash operations can be performed per
second. This means hash + lock + unlock. In reality you would also
queue, wait and wake. It is not very use-case driven.
The only thing that it measures is hash quality in terms of distribution
and the time spent to perform the hashing operation. If you want to
improve any of the two then this is the micro benchmark for it.
> Also, is there a benchmark that could be run to evaluate FUTEX2_NUMA, I would like to
> try it on multi-NUMA system to see the benefit.
Let me try to add that up to the test tool.
Sebastian