Re: [PATCH RFC] futex: avoid false sharing between hb->chain and the bucket lock
From: Peter Zijlstra
Date: Tue Jun 09 2026 - 06:47:27 EST
On Fri, Jun 05, 2026 at 09:53:12AM -0700, Breno Leitao wrote:
> struct futex_hash_bucket packs (atomic_t waiters, spinlock_t lock,
> struct plist_head chain, struct futex_private_hash *priv) into a
> single ____cacheline_aligned_in_smp 64-byte block. Three distinct
> access patterns hit that line:
>
> 1. Lockless atomic_read(&hb->waiters) via futex_hb_waiters_pending()
> on the fast path before taking the lock.
> 2. spin_lock(&hb->lock) contenders writing the lock word.
> 3. The lock holder modifying chain.{next,prev} on every futex_wake,
> futex_q_unlock, plist_add, __futex_unqueue.
>
> This was first noticed on a Meta cache (ucache) production workload:
> perf c2c on a busy 176-core AMD EPYC 9D64 ranked this exact cacheline as
> the #1 HITM source: 129 Local + 31 Remote HITM, hit by 156 distinct
> CPUs in a second.
>
> The contention is not specific to that workload, though. Our very own
> "perf bench futex" hash exercises the same buckets and shows the same
> false sharing, so the rest of this changelog quantifies the fix with
> perf bench futex.
So I can't see this. After 'fixing' the benchmark to run with a fixed
number of buckets (see below), a perf c2c record shows the
futex_hash_bucket::priv load to be the 'expensive' (when doing perf
report on that, rather than perf c2c report, because this latter is
total garbage)
> Move chain to its own cacheline so:
> - Lockless waiters_pending() readers no longer invalidate the line
> that lock contenders are spinning to acquire.
> - Cross-CCD lock handoffs ship only the (waiters, lock) line; the
> next holder reads chain from its own L2/L3 instead of fetching
> chain entries together with the lock byte.
>
> This improves "perf bench futex hash" on a 176-core AMD EPYC 9D64 by
> 15%:
>
> baseline +fix delta
> average 1,394,938 1,616,781 +15.9 %
> median 1,430,012 1,617,072 +13.1 %
> min 1,214,488 1,501,741 +23.5 %
> max 1,488,167 1,730,734 +16.3 %
>
> The distributions do not overlap: the slowest +fix run (1.50 M) is
> faster than every baseline run except the single fastest (1.49 M).
When I run: "perf bench futex hash", I do see massive contention, but
not on the line you mention. Instead we hammer mm->futex.phash.atomic in
futex_ref_{get,put}().
These are the atomic_long_inc_not_zero() / atomic_long_dec_and_test().
The reason this happens is unfortunate, you would want this thing to hit
the PERCPU fast-path, but due to the per-thread auto scaling, the
benchmark startup phase allocates a (2 thread) small hash, then a bigger
and a bigger, for each next thread that comes in.
Per there being a pending new hash, we drop to ATOMIC mode, such that we
can actually observe the 0 references.
However, because the benchmark is in fact hammering the buckets (per
design), it will never actually hit 0 references and swap in the larger
hash.
If one were to specific an explicit number of buckets, the benchmark
will function correctly:
v7.1-rc7 +patch
perf bench futex hash 192479 195523 +1.5%
perf bench futex hash -b 256 3453734 3987880 +15.5%
And then I do see the improvement from your patch, but I really cannot
make sense of your reasoning for it.
> Cost: one extra cacheline (56 B padding) per bucket. Would it be
> acceptable?
I'm really not sure, it *doubles* the futex memory cost.