Re: Robust futexes: lost wakeups and design flaws in the glibc/kernel synchronization scheme
From: Torvald Riegel
Date: Thu Jan 12 2017 - 04:39:53 EST
On Sat, 2016-12-24 at 17:01 +0100, Torvald Riegel wrote:
> TLDR:
> (1) The Linux kernel fails to wake futex waiters when a userspace thread
> that releases a mutex is killed between modifying the futex word and the
> subsequent FUTEX_WAKE syscall.
> (2) POSIX makes certain requirements regarding when destruction of a
> mutex is safe that are violated by the current glibc/kernel
> synchronization scheme, which can lead to corruption of userspace
> memory; robust FUTEX_UNLOCK_PI must be changed slightly, and the non-PI
> robust mutexes must either use a FUTEX_UNLOCK-like syscall to release
> the mutex, try to detect this by interpreting (current position in)
> userspace code, or we must add a new futex recovery syscall if we want
> to preserve the userspace-only mutex-release fastpath.
>
> [Please keep me in CC because I'm not subscribed to LKML.]
>
>
> === Robust mutexes have bugs, in both glibc and the kernel
>
> I've been reviewing the implementation of robust mutexes in both glibc
> and the kernel support code recently because there are several bug
> reports, for example:
> https://bugzilla.redhat.com/show_bug.cgi?id=1401665
> https://sourceware.org/bugzilla/show_bug.cgi?id=19402
> https://sourceware.org/bugzilla/show_bug.cgi?id=14485
>
> This review revealed a bunch of bugs. I have committed/proposed patches
> that fix all glibc-only bugs that I am aware of:
> https://sourceware.org/ml/libc-alpha/2016-12/msg00587.html
> https://sourceware.org/ml/libc-alpha/2016-12/msg00862.html
> https://sourceware.org/ml/libc-alpha/2016-12/msg00863.html
> https://sourceware.org/ml/libc-alpha/2016-12/msg00947.html
> These fixes are meant to be easy to backport, so they don't contain as
> much cleanup and explanation of the glibc code as would be ideal.
>
> I plan to follow these up with more cleanup and adding of documentation,
> which should also describe the glibc/kernel synchronization scheme in
> more detail. I believe I have found all the bugs in robust mutexes
> (when including the kernel-side bugs described in this email), but I'll
> make another review pass when preparing this cleanup+documentation
> patch.
>
>
> === Background
>
> A brief reminder of how glibc and the kernel attempt to synchronize with
> each other:
> Each thread maintains a list of robust mutexes that it has acquired, and
> a single pointer to a futex word that it is about to acquire or release
> (called op_pending). The list is a concurrent list (to some extent)
> because userspace can be interrupted by the post-crash cleanup code run
> by the kernel (ie, this is similar to how one would synchronize with a
> signal handler).
>
> To release a mutex without using FUTEX_UNLOCK_PI, userspace basically
> does the following:
> (L1) atomic_store (&op_pending, &mutex->futex, memory_order_relaxed);
> (L2) <"thread-safe" dequeue of mutex->futex from robust mutex list>
> (L3) fw = atomic_exchange(&mutex->futex, 0, memory_order_release);
> (L4) if (fw & FUTEX_WAITERS) futex_wake(&mutex->futex, 1);
> (L5) atomic_store (&op_pending, NULL, memory_order_relaxed);
> (I'm not showing the signal fences required; for simplicity, assume that
> there is a signal fence after each line.)
>
> In case of a robust PI mutex, L3 and L4 are done by the kernel in
> FUTEX_UNLOCK_PI.
>
>
> === Lost wakeups caused by the kernel
>
> If a thread happens to be killed between L3 and L4 and FUTEX_WAITERS is
> set, then userspace will have set the futex word to 0 but not yet have
> run FUTEX_WAKE.
> op_pending still points to the futex word, but handle_futex_death in
> kernel/futex.c only calls futex_wake if the TID bits in the futex word
> are equal to the TID of the killed thread. The result of that is a lost
> wake-up for one of the threads that set or shared FUTEX_WAITERS.
>
> This can be fixed by running futex_wake conservatively, even if the
> value of *op_ending does not match the TID of the killed thread. This
> can lead to spurious wakeups (eg, if the thread was killed between L4
> and L5), but userspace code has to, in the general case, be prepared for
> spurious futex wake-ups anyway if using POSIX mutexes:
> https://lkml.org/lkml/2014/11/27/472
> Also, the same spurious wake-ups can happen under slightly different
> executions without a crash (eg, if FUTEX_WAITERS is set spuriously due
> to timeouts on another thread that set it).
>
>
> === POSIX' mutex destruction requirements are violated
>
> In a nutshell, the mutex destruction requirements mean that there must
> be a single atomic action that releases a mutex; after a mutex is
> released, the thread releasing the mutex must not access the mutex'
> memory anymore (but it is still allowed to run a FUTEX_WAKE on it); see
> https://lkml.org/lkml/2014/11/27/472 for additional background.
>
> Currently, the kernel can potentially modify the mutex' memory until L5
> has executed. This means that getting killed while releasing a robust
> mutex is not guaranteed to be a single atomic action.
>
> For example, assume that Thread1 has acquired the mutex and releases it,
> but is killed right before L5; if the program is built in such a way
> that only Thread2 will acquire the mutex after Thread1 has released it,
> then POSIX allows Thread2 to acquire the mutex, release it, destroy it,
> and reuse its memory for something else (these are the destruction
> requirements). So, Thread2 does that and reuses the memory for another
> object that is not a mutex, which happens to put the same value as the
> TID of the killed thread at the position of the futex word. Then, the
> kernel sees that *op_pending equals the TID, and sets *op_pending to 0,
> thus corrupting the other object.
> (Note that there is no similar problem when acquiring the mutex because
> just the existance of a thread attempting to acquire the mutex disallows
> attempting to concurrently destroy the mutex.)
>
> For FUTEX_UNLOCK_PI, the fix is relatively straightforward: The kernel
> must clear op_pending in the syscall if op_pending is equal to the
> address of the unlocked futex. This ensures that unlock is a single
> atomic action; the store by the kernel is harmless at least for glibc
> (and it would certainly be possible to get the same effect without a
> userspace-visible change).
>
> If the userspace fastpath is used (ie, L3), then there is no simple fix.
> Never using the fastpath but only something like FUTEX_UNLOCK_PI slows
> down lock release, which can be bad for performance.
>
> When using the fastpath, and to be able to solve this in the kernel when
> cleaning up after the thread is killed, the kernel would need to be able
> to detect whether the thread was killed before the critical atomic
> action in L3; the kernel cannot use the value of the futex word to
> determine this because it is not guaranteed to be a futex word of a
> robust mutex anymore. Thus, the instruction pointer of the userspace
> thread would have to be taken into account, and userspace would have to
> enforce code generation in a way that makes it possible for the kernel
> to easily detect whether the mutex release sequence is after L3 or not.
> On the userspace side, the release sequence starting with L1 is not
> exactly trivial, and I'd prefer to not have to maintain assembly for
> that (OTOH, maybe some of the machinery of restartable sequences could
> be reused here). Therefore, that's not an approach I'd prefer.
>
> Alternatively, it should be possible to let threads that want to acquire
> a mutex with a dead owner request kernel-side recovery of this mutex.
> Basically, the mutex-releasing thread would first set the futex word to
> an intermediate state that signals that it intends to release the mutex
> (eg, setting a special bit but preserving the bits for the TID); it
> would then dequeue the mutex from its robust futex list; and then set
> the futex word to 0 to finalize release.
> Threads that see the special bit in the futex word signaling the intent
> to release cannot simply acquire the mutex in userspace. They would
> have to call a new futex operation in the kernel that then either blocks
> until woken by a normal futex_wake, or recovers the mutex when the
> thread that intended to release it was killed in the meantime.
> The advantage of this approach would be that the interface is simpler
> for userspace (eg, no funny interpretation of instruction pointers or
> such). However, the kernel-side synchronization is probably more
> complex, and there are a couple of details to be solved such as avoiding
> an ABA on the TID of the thread that intended to release the mutex.
> If there is interest in this approach, I can work on the details of the
> general synchronization scheme (though I'd like someone else to take
> care of the kernel-side implementation).
>
> Thoughts?
Ping.