Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u])

From: Emergency Services Jamie Lokier
Date: Sun Nov 14 2004 - 04:02:45 EST

Andrew Morton wrote:
> Jamie, if you're around: help!

Revert the patch which moves queue_me(); it's buggy. It is a bug to
move queue_me() after get_user().

It fully explains the blocking threads in Apache and Evolution.

Even if it worked, the patch wouldn't have saved any time, as it's a
rare condition if the caller is using futex properly.

The patch below provides an explanation; I'd appreciate it being


> - According to man of futex:
> "If the futex was not equal to the expected value, the operation
> returns -EWOULDBLOCK."
> but now, here is no description about the rare case:
> "returns 0 if the futex was not equal to the expected value, but
> the process was woken by a FUTEX_WAKE call."
> this behavior on rare case causes the hang which I found.

This case can occur, by design.

Bert, are you still updating the futex man pages? (Or is anyone else?)

If you are, then:

The patch below might provide some text for use in the manual, but
even if you can't easily explain why, the possibility of FUTEX_WAIT
returning 0 and counting as a wakeup when the memory word doesn't
equal val should be mentioned.

I'd appreciate being added to the authors list while you're there,
thanks :)

I think the man page would be a little clearer if the various E
conditions (ETIMEDOUT etc.) were listed in the errors section (even
though they aren't errors). Think about consitency with other man
pages which list EINTR and EAGAIN there. Also, it would be consistent
to say EAGAIN instead of EWOULDBLOCK (they're synonyms in Linux
anyway, but other man pages use EAGAIN as it's the modern name for it).

The phrase "(or other spurious error)" should be removed as it's
actually a kernel bug (but not serious) for that to occur, and
no different to EINTR from other syscalls in that respect.

In the section for FUTEX_WAIT behaviour, you might explain what
"atomically verifies .. and sleeps awaiting FUTEX_WAKE" really means,
perhaps removing the word atomic. It's not really
atomic-test-conditional-sleep, it's just carefully ordered. (Though
it's equivalent to atomic-sleep-test followed by conditional-wake).

The difference is precisely that it may return 0 and count as a wakeup
even when the memory word doesn't match prior to the
effectively-atomic region.


Hidetoshi Seto's example (at the FTP URL with the patch) calls
pthread_cond_signal without mentioning a mutex. That's the wrong way
to use pthread_cond_signal, as explained in the Glibc documentation.

Note that moving queue_me() after get_user() in futex_wake() does NOT
fix Hidetoshi's observed problem.

Just think about the same 4 threads in "[simulation]", but scheduled
in a slightly different sequence. Especially, look at splitting up
the sequence _beteen_ get_user() and queue_me(), and _between_ "wake++
and updated futex val" and "FUTEX_WAKE: no one is in waitqueue / A is
in waitqueue".

The basic logical reason why Hidetoshi's patch doesn't fix anything is
that if the get_user() test is done before queue_me() in the kernel,
that is *exactly the same* as if userspace does the word test itself
just before calling FUTEX_WAIT and FUTEX_WAIT doesn't do any test at all.

In Hidetoshi's pseudo-code, the bug is in pthread_cond_signal: it
should test the return value of FUTEX_WAKE and increment the wake
variable conditionally, not unconditionally as it does. Fix that, and
subsequent signals will wake B. The reason B is not woken initially
is because mutexes aren't used. These aren't futex bugs.


You're right about the double-down_read() problem - I hadn't realised
that could deadlock. That will require a per-task flag to make the
fault handler not take the semaphore when the fault occurs in these
places. But that's a separate bug, not addressed here.


That ->nqueued loop in FUTEX_CMP_REQUEUE is able to return -EAGAIN
even when the memory word does equal the argument - that's quite ugly.

That and the smp_mb() section look dubious. They're a workaround to
simulate doing something inside the spinlocks, but that is different
to the ordering properties that FUTEX_WAIT offers.

I mention this because it's nearly the same problem as prompted this
patch: that FUTEX_WAIT isn't as atomic as some people think it is, and
most importantly, making it more atomic (by using the spinlocks) does
not fix design problems in the caller.

That suggests to me that the callers of FUTEX_CMP_REQUEUE, if they
depend on that ->nqueued / smb_mb() loop, then they may have a race
which will cause problems. If they don't depend on it, then it
shouldn't be there.

In fact that whole primitive does not look very conceptually
convincing. Some kind of requeue-and-test primtive makes sense, but
conceptually, it would make sense to be testing *uaddr2 at the same
time, but it doesn't.


Signed-off-by: Jamie Lokier <jamie@xxxxxxxxxxxxx>

Explain why futex waiters must queue the current thread before testing
the memory word, not after. Consequently, futex_wait() can return 0
and count as a wakeup even if the memory word doesn't match the
value at the start of the syscall.

c.orig 2004-11-03 04:04:50.000000000 +0000
+++ linux-2.6.9/kernel/futex.c 2004-11-14 08:58:27.067607610 +0000
@@ -6,7 +6,7 @@
* (C) Copyright 2003 Red Hat Inc, All Rights Reserved
* Removed page pinning, fix privately mapped COW pages and other cleanups
- * (C) Copyright 2003 Jamie Lokier
+ * (C) Copyright 2003, 2004 Jamie Lokier
* Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
* enough at me, Linus for the original (flawed) idea, Matthew
@@ -489,9 +489,24 @@
queue_me(&q, -1, NULL);

- * Access the page after the futex is queued.
+ * Access the page AFTER the futex is queued.
+ * Order is important:
+ *
+ * Userspace waiter: val = var; if (cond(val)) futex_wait(&var, val);
+ * Userspace waker: if (cond(var)) { var = new; futex_wake(&var); }
+ *
+ * The basic logical guarantee of a futex is that it blocks ONLY
+ * if cond(var) is known to be true at the time of blocking, for
+ * any cond. If we queued after testing *uaddr, that would open
+ * a race condition where we could block indefinitely with
+ * cond(var) false, which would violate the guarantee.
+ *
+ * A consequence is that futex_wait() can return zero and absorb
+ * a wakeup when *uaddr != val on entry to the syscall. This is
+ * rare, but normal.
+ *
* We hold the mmap semaphore, so the mapping cannot have changed
- * since we looked it up.
+ * since we looked it up in get_futex_key.
if (get_user(curval, (int __user *)uaddr) != 0) {
ret = -EFAULT;
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/