Re: [PATCH] futex: fix NUMA node publication race causing missed wakeups
From: Peter Zijlstra
Date: Thu Mar 12 2026 - 05:55:45 EST
On Thu, Mar 12, 2026 at 10:37:09AM +0100, Sebastian Andrzej Siewior wrote:
> On 2026-03-03 03:01:00 [+0000], Chengfeng Ye wrote:
> > get_futex_key() publishes the FUTEX2_NUMA node side word in userspace.
> > The publication path used a non-atomic read/compute/write sequence, so
> > concurrent callers could overwrite each other during initialization.
> >
> > This race can make concurrent operations on the same futex derive
> > different node values while the NUMA hint is being initialized,
> > resulting in inconsistent futex keying between wait and wake sides.
> > In practice this can lead to missed wakeups; at user level, missed
> > wakeups can manifest as threads waiting indefinitely
> > (application-level deadlock/hang).
> >
> > PoC description (see Link below):
> > - two threads repeatedly exercising FUTEX2_NUMA wait/wake on the
> > same futex,
> > - waiter and waker pinned to CPUs from different NUMA nodes,
> > - waker continuously issuing wake calls while waiter performs
> > 10-second timed waits.
> >
> > PoC output on unpatched kernel (wake sigal missed and waiter timeout):
> > - observed on Linux v7.0-rc2 running in qemu-system-x86_64 with
> > 4 vCPUs
> > Using CPU 0 (waiter) and CPU 2 (waker) from different NUMA nodes
> > [TRIGGER EVENT #1] iter=38 timed out (futex.node=1)
> > [TRIGGER EVENT #2] iter=85 timed out (futex.node=1)
> > [TRIGGER EVENT #3] iter=95 timed out (futex.node=1)
> >
> > Fix by making node-hint publication publish-once via atomic cmpxchg on
> > naddr (FUTEX_NO_NODE -> computed node), retrying transient -EAGAIN,
> > and adopting/validating the winner value on contention.
> >
> > Fixes: c042c505210d ("futex: Implement FUTEX2_MPOL")
> > Link: https://gist.github.com/Ychame/d4a5e95401a471f4211a751734b5d164
> > Signed-off-by: Chengfeng Ye <dg573847474@xxxxxxxxx>
>
> I did point out this scenario and it was said that this should not be
> done this way. Initialize once and be done with it plus with mpol the
> value should be consistent.
Right, see tools/testing/selftests/futex/functional/futex_numa.c, that
has a very simple numa lock implementation you can crib from.
You can only clear the node word when you clear the waiter bit (eg,
there are no more waiters left) and it must be done atomically such that
any concurrent lock operation will DTRT.
Specifically, futex_numa_32 requires an 64bit cmpxchg.