Re: [PATCH] futex: avoid race between requeue and wake

From: Jan Stancek
Date: Tue Apr 08 2014 - 14:01:16 EST






----- Original Message -----
> From: "Linus Torvalds" <torvalds@xxxxxxxxxxxxxxxxxxxx>
> To: "Jan Stancek" <jstancek@xxxxxxxxxx>
> Cc: "Linux Kernel Mailing List" <linux-kernel@xxxxxxxxxxxxxxx>, "Srikar Dronamraju" <srikar@xxxxxxxxxxxxxxxxxx>,
> "Davidlohr Bueso" <davidlohr@xxxxxx>, "Ingo Molnar" <mingo@xxxxxxxxxx>, "Larry Woodman" <lwoodman@xxxxxxxxxx>
> Sent: Tuesday, 8 April, 2014 6:20:42 PM
> Subject: Re: [PATCH] futex: avoid race between requeue and wake
>
> Davidlohr, comments?
>
> On Tue, Apr 8, 2014 at 1:47 AM, Jan Stancek <jstancek@xxxxxxxxxx> wrote:
> > pthread_cond_broadcast/4-1.c testcase from openposix testsuite (LTP)
> > occasionally fails, because some threads fail to wake up.
>
> Jan, I _assume_ this is on x86(-64), but can you please confirm?

I was able to reproduce this on x86-64 and s390x. I used mostly
s390x system with 2 CPUs, because I could reproduce it there
a lot faster (usually in seconds).

My reproducer is modified pthread_cond_broadcast/4-1.c testcase,
with some sleeps removed/shortened:
http://jan.stancek.eu/tmp/futex_race/pcb.c
gcc pcb.c -o pcb -lpthread
I'm running it in loop until retcode != 0.

I also experimented with this patch when trying to make it
more visible (so I could capture some strace logs):

diff --git a/kernel/futex.c b/kernel/futex.c
index 5ff12f5..290a67f 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -47,6 +47,7 @@
#include <linux/slab.h>
#include <linux/poll.h>
#include <linux/fs.h>
+#include <linux/delay.h>
#include <linux/file.h>
#include <linux/jhash.h>
#include <linux/init.h>
@@ -1554,6 +1555,7 @@ retry_private:
*/
if (++task_count <= nr_wake && !requeue_pi) {
wake_futex(this);
+ udelay(10000);
continue;
}

>
> Because if it's on anything else, the whole situation changes.
>
> > Taking hb->lock in this situation will ensure that thread A needs to wait
> > in futex_wake() until main thread finishes requeue operation.
>
> So the argument was that doing *both* spin_is_locked() and
> atomic_read(&hb->waiters) _should_ be unnecessary, because
> hb_waiters_inc() is done *before* getting the spinlock
>
> However, one exception to this is "requeue_futex()". Which is in fact
> the test-case that Jan points to. There, when we move a futex from one
> hash bucket to another, we do the increment inside the spinlock.
>
> So I think the change is correct, although the commit message might
> need a bit of improvement. I also hate that "if/else" thing, since
> there's no point in an "else" if the if-statement did a "return". So
> either make it just
>
> if (spin_is_locked(&hb->lock))
> return 1;
> return atomic_read(&hb->waiters);
>
> or (perhaps preferably) make it
>
> return spin_is_locked(&hb->lock) || atomic_read(&hb->waiters);
>
> but that "if/else" just makes me go "why?".

Understood, I can submit v2 if Davidlohr will also be in favor.
If not, I'm happy to test different patch proposal with my setup.

Regards,
Jan

>
> But I'd also like to have Davidlohr look at this, because I have a few
> questions:
>
> - how did this never show up in the original loads? No requeueing
> under those test-loads?
>
> - would we be better off incrementing the waiter count at the top of
> futex_requeue(), at the retry_private label?
>
> That would make us follow the "has to be incremented before taking the
> lock" rule, but at the expense of making the error case handling more
> complex. Although maybe we could do it as part of
> "double_lock/unlock_hb()" and just mark both hb1/hb2 busy?
>
> Linus
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/