Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?

From: Mathieu Desnoyers

Date: Fri Feb 20 2026 - 16:42:21 EST

+CC libc-alpha.

On 2026-02-20 15:26, André Almeida wrote:

During LPC 2025, I presented a session about creating a new syscall for
robust_list[0][1]. However, most of the session discussion wasn't much related
to the new syscall itself, but much more related to an old bug that exists in
the current robust_list mechanism.

Since at least 2012, there's an open bug reporting a race condition, as
Carlos O'Donell pointed out:

"File corruption race condition in robust mutex unlocking"
https://sourceware.org/bugzilla/show_bug.cgi?id=14485

To help understand the bug, I've created a reproducer (patch 1/2) and a
companion kernel hack (patch 2/2) that helps to make the race condition
more likely. When the bug happens, the reproducer shows a message
comparing the original memory with the corrupted one:

"Memory was corrupted by the kernel: 8001fe8d8001fe8d vs 8001fe8dc0000000"

I'm not sure yet what would be the appropriated approach to fix it, so I
decided to reach the community before moving forward in some direction.
One suggestion from Peter[2] resolves around serializing the mmap() and the
robust list exit path, which might cause overheads for the common case,
where list_op_pending is empty.

However, giving that there's a new interface being prepared, this could
also give the opportunity to rethink how list_op_pending works, and get
rid of the race condition by design.

Feedback is very much welcome.

Looking at this bug, one thing I'm starting to consider is that it
appears to be an issue inherent to lack of synchronization between
pthread_mutex_destroy(3) and the per-thread list_op_pending fields
and not so much a kernel issue.

Here is why I think the issue is purely userspace:

Let's suppose we have a shared memory area across Processes 1 and Process 2,
which internally have its own custom memory allocator in userspace to
allocate/free space within that shared memory.

Process 1, Thread A stumbles through the scenario highlighted by this bug, and
basically gets preempted at this FIXME in libc __pthread_mutex_unlock_full():

if (__glibc_unlikely ((atomic_exchange_release (&mutex->__data.__lock, 0)
& FUTEX_WAITERS) != 0))
futex_wake ((unsigned int *) &mutex->__data.__lock, 1, private);

/* We must clear op_pending after we release the mutex.
FIXME However, this violates the mutex destruction requirements
because another thread could acquire the mutex, destroy it, and
reuse the memory for something else; then, if this thread crashes,
and the memory happens to have a value equal to the TID, the kernel
will believe it is still related to the mutex (which has been
destroyed already) and will modify some other random object. */
__asm ("" ::: "memory");
THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);

Then Process 1, Thread B runs, grabs the lock, releases it, and based on
program state it knows it can pthread_mutex_destroy() this lock, free its
associated memory through the custom shared memory allocator, and allocate
it for other purposes. Then we get to the point where Process 1 is
killed, and where the robust futex kernel code corrupts data in shared
memory because of the dangling list_op_pending pointer.

That shared memory data is still observable by Process B, which will get a
corrupted state.

Notice how this all happens without any munmap(2)/mmap(2) in the sequence ?
This is why I think this is purely a userspace issue rather than an issue
we can solve by adding extra synchronization in the kernel.

The one point we have in that sequence where I think we can add synchronization
is pthread_mutex_destroy(3) in libc. One possible "big hammer" solution would be
to make pthread_mutex_destroy iterate on all other threads list_op_pending
and busy-wait if it finds that the mutex address is in use. It would of course
only have to do that for robust futexes.

If that big hammer solution is not fast enough for many-threaded use-cases,
then we can think of other approaches such as adding a reference counter
in the mutex structure, or introducing hazard pointers in userspace to reduce
synchronization iteration from nr_threads to nr_cpus (or even down to max
rseq mm_cid).

Thoughts ?

Thanks,

Mathieu

Thanks!
André

[0] https://lore.kernel.org/lkml/20251122-tonyk-robust_futex-v6-0-05fea005a0fd@xxxxxxxxxx/
[1] https://lpc.events/event/19/contributions/2108/
[2] https://lore.kernel.org/lkml/20241219171344.GA26279@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com