Re: [patch V5 11/16] futex: Provide infrastructure to plug the non contended robust futex unlock race

From: Peter Zijlstra

Date: Wed Jun 03 2026 - 05:16:50 EST

On Tue, Jun 02, 2026 at 11:10:04AM +0200, Thomas Gleixner wrote:

> On X86 this boils down to this simplified assembly sequence:
>
> mov %esi,%eax // Load TID into EAX
> xor %ecx,%ecx // Set ECX to 0
> #3 lock cmpxchg %ecx,(%rdi) // Try the TID -> 0 transition
> .Lstart:
> jnz .Lend
> #4 movq %rcx,(%rdx) // Clear list_op_pending
> .Lend:
>
> If the cmpxchg() succeeds and the task is interrupted before it can clear
> list_op_pending in the robust list head (#4) and the task crashes in a
> signal handler or gets killed then it ends up in do_exit() and subsequently
> in the robust list handling, which then might run into the unmap/map issue
> described above.
>
> This is only relevant when user space was interrupted and a signal is
> pending. The fix-up has to be done before signal delivery is attempted
> because:
>
> 1) The signal might be fatal so get_signal() ends up in do_exit()
>
> 2) The signal handler might crash or the task is killed before returning
> from the handler. At that point the instruction pointer in pt_regs is
> not longer the instruction pointer of the initially interrupted unlock
> sequence.

However, due to the pending field being strictly per thread (thread
local storage and all that), the whole construct of futex robust unlock
is not signal safe in the sense that signal handlers must not use it.

A signal handler trying to use this would result in nested use of the
pending field, and that leads to corrupted state.

> The right place to handle this is in __exit_to_user_mode_loop() before
> invoking arch_do_signal_or_restart() as this covers obviously both
> scenarios.
>
> As this is only relevant when the task was interrupted in user space, this
> is tied to RSEQ and the generic entry code as RSEQ keeps track of user
> space interrupts unconditionally even if the task does not have a RSEQ
> region installed. That makes the decision very lightweight:
>
> if (current->rseq.user_irq && within(regs, csr->unlock_ip_range))
> futex_fixup_robust_unlock(regs, csr);
>
> futex_fixup_robust_unlock() then invokes a architecture specific function
> to returen the pending op pointer or NULL. The function evaluates the
> register content to decide whether the pending ops pointer in the robust
> list head needs to be cleared.
>
> Assuming the above unlock sequence, then on x86 this decision is the
> trivial evaluation of the zero flag:
>
> return regs->eflags & X86_EFLAGS_ZF ? regs->dx : NULL;
>
> Other architectures might need to do more complex evaluations due to LLSC,
> but the approach is valid in general. The size of the pointer is determined
> from the matching range struct, which covers both 32-bit and 64-bit builds
> including COMPAT.

So my initial thoughts today were that we should probably also move the
IP to .Lend, to avoid userspace from writing to that location again.

However, due to the above mentioned restrictions vs signals, there
cannot be a situation where this matters, and so the point is moot.

A double store is harmless and it makes the kernel just this little bit
simpler.

The only reason I'm sending this email is to have this more explicitly
documented for posterity I suppose ;-)

> The unlock sequence is going to be placed in the VDSO so that the kernel
> can keep everything synchronized, especially the register usage. The
> resulting code sequence for user space is:
>
> if (__vdso_futex_robust_list$SZ_try_unlock(lock, tid, &pending_op) != tid)
> err = sys_futex($OP | FUTEX_ROBUST_UNLOCK,....);
>
> Both the VDSO unlock and the kernel side unlock ensure that the pending_op
> pointer is always cleared when the lock becomes unlocked.