Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL

From: Shrikanth Hegde
Date: Wed Mar 26 2025 - 08:55:17 EST

Next message: Takeshi Ogasawara: "Re: [PATCH 6.13 000/119] 6.13.9-rc1 review"
Previous message: Chris Bainbridge: "[PATCH] drm/nouveau: prime: drm_prime_gem_destroy comment"
In reply to: Sebastian Andrzej Siewior: "Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL"
Next in thread: Sebastian Andrzej Siewior: "Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 3/26/25 15:01, Sebastian Andrzej Siewior wrote:

On 2025-03-26 00:34:23 [+0530], Shrikanth Hegde wrote:

Hi Sebastian.

Hi Shrikanth,

Hi.

So, did some more bench-marking using the same perf futex hash.
I see that perf creates N threads and binds each thread to a CPU and then
calls futex_wait such that it never blocks. It always returns EWOULDBLOCK.
only futex_hash is exercised.

It also does spin_lock() + unlock on the hash bucket. Without the
locking, you would have constant numbers.

Thanks for explanations.

Plus the way perf is doing, it would cause all the SMT threads to be up and 1 case
probably get the benefit of SMT folding. So anything after 40 threads, numbers don't change with baseline.

Numbers with different threads. (private futexes)
threads baseline with series (ratio)
1 3386265 3266560 0.96
10 1972069 821565 0.41
40 1580497 277900 0.17
80 1555482 150450 0.096

With Shared Futex: (-s option)
Threads baseline with series (ratio)
80 590144 585067 0.99

The shared numbers are equal since the code path there is unchanged.

After looking into code, and after some hacking, could get the
performance back with below change. this is likely functionally not correct.
the reason for below change is,

1. perf report showed significant time in futex_private_hash_put.
so removed rcu usage for users. that brought some improvements.
from 150k to 300k. Is there a better way to do this users protection?

This is likely from the atomic dec operation itself. Then there is also
the preemption counter operation. The inc should be also visible but
might be inlined into the hash operation.
This is _just_ the atomic inc/ dec that doubled the "throughput" but you
don't have anything from the regular path.
Anyway. To avoid the atomic part we would need to have a per-CPU counter
instead of a global one and a more expensive slow path for the resize
since you have to sum up all the per-CPU counters and so on. Not sure it
is worth it.

resize would happen when one does prctl right? or
it can happen automatically too?

fph is going to be on thread leader's CPU and using atomics to do
fph->users would likely cause cacheline bouncing no?

Not sure if this happens only due to this benchmark which doesn't actually block.
Maybe the real life use-case this doesn't matter.

2. Since number of buckets would be less by default, this would cause hb
collision. This was seen by queued_spin_lock_slowpath. Increased the hash
bucket size what was before the series. That brought the numbers back to
1.5M. This could be achieved with prctl in perf/bench/futex-hash.c i guess.

Yes. The idea is to avoid a resize at runtime and setting to something
you know best. You can also use it now to disable the private hash and
stick with the global.

yes. SET_SLOTS would take care of it.

Note: Just increasing the hash bucket size without point 1, didn't matter much.

Sebastian

Next message: Takeshi Ogasawara: "Re: [PATCH 6.13 000/119] 6.13.9-rc1 review"
Previous message: Chris Bainbridge: "[PATCH] drm/nouveau: prime: drm_prime_gem_destroy comment"
In reply to: Sebastian Andrzej Siewior: "Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL"
Next in thread: Sebastian Andrzej Siewior: "Re: [PATCH v10 00/21] futex: Add support task local hash maps, FUTEX2_NUMA and FUTEX2_MPOL"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]