Re: [RFC PATCH 0/2] futex: how to solve the robust_list race condition?

From: André Almeida

Date: Fri Feb 27 2026 - 14:18:43 EST

Hi Mathieu,

Em 20/02/2026 20:17, Mathieu Desnoyers escreveu:

On 2026-02-20 17:41, Mathieu Desnoyers wrote:

On 2026-02-20 16:42, Mathieu Desnoyers wrote:

+CC libc-alpha.

On 2026-02-20 15:26, André Almeida wrote:

During LPC 2025, I presented a session about creating a new syscall for
robust_list[0][1]. However, most of the session discussion wasn't much related
to the new syscall itself, but much more related to an old bug that exists in
the current robust_list mechanism.

Since at least 2012, there's an open bug reporting a race condition, as
Carlos O'Donell pointed out:

   "File corruption race condition in robust mutex unlocking"
   https://sourceware.org/bugzilla/show_bug.cgi?id=14485

To help understand the bug, I've created a reproducer (patch 1/2) and a
companion kernel hack (patch 2/2) that helps to make the race condition
more likely. When the bug happens, the reproducer shows a message
comparing the original memory with the corrupted one:

   "Memory was corrupted by the kernel: 8001fe8d8001fe8d vs 8001fe8dc0000000"

I'm not sure yet what would be the appropriated approach to fix it, so I
decided to reach the community before moving forward in some direction.
One suggestion from Peter[2] resolves around serializing the mmap() and the
robust list exit path, which might cause overheads for the common case,
where list_op_pending is empty.

However, giving that there's a new interface being prepared, this could
also give the opportunity to rethink how list_op_pending works, and get
rid of the race condition by design.

Feedback is very much welcome.

Looking at this bug, one thing I'm starting to consider is that it
appears to be an issue inherent to lack of synchronization between
pthread_mutex_destroy(3) and the per-thread list_op_pending fields
and not so much a kernel issue.

Here is why I think the issue is purely userspace:

Let's suppose we have a shared memory area across Processes 1 and Process 2,
which internally have its own custom memory allocator in userspace to
allocate/free space within that shared memory.

Process 1, Thread A stumbles through the scenario highlighted by this bug, and
basically gets preempted at this FIXME in libc __pthread_mutex_unlock_full():

       if (__glibc_unlikely ((atomic_exchange_release (&mutex- >__data.__lock, 0)
                              & FUTEX_WAITERS) != 0))
         futex_wake ((unsigned int *) &mutex->__data.__lock, 1, private);

       /* We must clear op_pending after we release the mutex.
          FIXME However, this violates the mutex destruction requirements
          because another thread could acquire the mutex, destroy it, and
          reuse the memory for something else; then, if this thread crashes,
          and the memory happens to have a value equal to the TID, the kernel
          will believe it is still related to the mutex (which has been
          destroyed already) and will modify some other random object. */
       __asm ("" ::: "memory");
       THREAD_SETMEM (THREAD_SELF, robust_head.list_op_pending, NULL);

Then Process 1, Thread B runs, grabs the lock, releases it, and based on
program state it knows it can pthread_mutex_destroy() this lock, free its
associated memory through the custom shared memory allocator, and allocate
it for other purposes. Then we get to the point where Process 1 is
killed, and where the robust futex kernel code corrupts data in shared
memory because of the dangling list_op_pending pointer.

That shared memory data is still observable by Process B, which will get a
corrupted state.

Notice how this all happens without any munmap(2)/mmap(2) in the sequence ?
This is why I think this is purely a userspace issue rather than an issue
we can solve by adding extra synchronization in the kernel.

The one point we have in that sequence where I think we can add synchronization
is pthread_mutex_destroy(3) in libc. One possible "big hammer" solution would be
to make pthread_mutex_destroy iterate on all other threads list_op_pending
and busy-wait if it finds that the mutex address is in use. It would of course
only have to do that for robust futexes.

If that big hammer solution is not fast enough for many-threaded use- cases,
then we can think of other approaches such as adding a reference counter
in the mutex structure, or introducing hazard pointers in userspace to reduce
synchronization iteration from nr_threads to nr_cpus (or even down to max
rseq mm_cid).

To make matters even worse, the pthread_mutex_destroy(3) and reallocation
could happen from Process 2 rather than Process 1. So iterating on a
threads from Process 1 is not sufficient. We'd need to synchronize
pthread_mutex_destroy on something within the mutex structure which is
observable from all processes using the lock, for instance a reference count.
Trying to find a backward compatible way to solve this may be tricky.
Here is one possible approach I have in mind: Introduce a new syscall,
e.g. sys_cleanup_robust_list(void *addr)

This system call would be invoked on pthread_mutex_destroy(3) of
robust mutexes, and do the following:

- Calculate the offset of @addr within its mapping,
- Iterate on all processes which map the backing store which contain
the lock address @addr.
- Iterate on each thread sibling within each of those processes,
    - If the thread has a robust list, and its list_op_pending points
      to the same offset within the backing store mapping, clear the
      list_op_pending pointer.

The overhead would be added specifically to pthread_mutex_destroy(3),
and only for robust mutexes.

Thoughts ?

Right, your explanation makes sense to me. I think the only difference between alloc/free and map/munmap is that ""freeing" memory does not actually return it to the operating system for other applications to use"[1], so I don't know if this custom allocator is violating some memory rules.

About the system call, we would call sys_cleanup_robust_list() before freeing/unmapping the robust mutex. To guarantee that we check every process that shares the memory region, would we need to check *every* single process? I don't think there's a way find a way to find such maps without checking them all.

I'm trying to explore the idea about the reference counter. Would the mummap() be blocked till the refcount goes to zero or something like that? I've also tried to find more examples of a memory region that's shared between one or more process and the kernel at the same time to get some inspiration, but it seems robust_list is a quite unique design on its own regarding this memory sharing problem.

[1] https://sourceware.org/glibc/wiki/MallocInternals