Re: [PATCH] futex: avoid race between requeue and wake
From: Davidlohr Bueso
Date: Tue Apr 08 2014 - 22:20:18 EST
Adding Thomas to the thread.
Sorry for the late reply, I was out running errands all day just to get
home to find this futex jewel in my inbox.
On Tue, 2014-04-08 at 15:30 -0700, Linus Torvalds wrote:
> On Tue, Apr 8, 2014 at 2:02 PM, Jan Stancek <jstancek@xxxxxxxxxx> wrote:
> >
> > I ran reproducer with following change on s390x system, where this
> > can be reproduced usually within seconds:
> >
> > diff --git a/kernel/futex.c b/kernel/futex.c
> > index 67dacaf..9150ffd 100644
> > --- a/kernel/futex.c
> > +++ b/kernel/futex.c
> > @@ -1095,6 +1095,7 @@ static int unlock_futex_pi(u32 __user *uaddr, u32 uval)
> > static inline void
> > double_lock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
> > {
> > + hb_waiters_inc(hb2);
> > if (hb1 <= hb2) {
> > spin_lock(&hb1->lock);
> > if (hb1 < hb2)
> > @@ -1111,6 +1112,7 @@ double_unlock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
> > spin_unlock(&hb1->lock);
> > if (hb1 != hb2)
> > spin_unlock(&hb2->lock);
> > + hb_waiters_dec(hb2);
> > }
> >
> > /*
> >
> > Reproducer is running without failures over an hour now and
> > made ~1.4 million iterations.
>
> Ok, that's encouraging. That is the smallest patch I could come up
> with, but as mentioned, it's not optimal. We only need it for
> futex_requeue(), but if we do it there we'd have to handle all the
> different error cases (there's only one call to double_lock_hb(), but
> due to the error cases there's four calls to double_unlock_hb().
For consistency and mental sanity, I definitely prefer this alternative
to adding back the spin_is_locked check.
Linus, from what I see from your approach in always adding and
decrementing the hb->waiters count in futex_requeue right before the
whole double_[un]lock_hb() calls, we're basically saying "lets not do
this pending waiters optimization" for futex requeuing, right?
Which is fine, requeing isn't really that used or performance critical
in many cases. But I say it since the legitimate accounting for is done
in requeue_futex(), which can obviously be bogus as we increment *after*
taking the hb->lock. Just want to make sure we're on the same page here.
>
> I'm not sure how much we care. The simple patch basically adds two
> (unnecessary) atomics to the futex_wake_op() path. I don't know how
> critical that path is - not as critical as the regular "futex_wake()",
> I'd expect, but I guess pthread_cond_signal() is the main user.
Agreed, since the issue occurs because we're requeuing *waiters*, lets
keep it inside the requeueing only. In the case of futex_wake_op() it
doesn't matter as we don't need to account for them. It's more code, but
that's it. I'd rather add error house keeping than add more unnecessary
logic to other paths of futexes.
>
> So I'll have to leave this decision to the futex people. But the
> attached slightly more complex patch *may* be the better one.
>
> May I bother you to test this one too? I really think that
> futex_requeue() is the only user that should need this, so doing it
> there rather than in double_[un]lock_hb() should be slightly more
> optimal, but who knows what I've missed. We clearly *all* missed this
> race back when the ordering rules were documented..
Yep, it's quite an obvious thing we overlooked here, and not even arch
specific... I'm surprised that the requeueing path isn't stressed more
often, and while the race window is quite small (I'm still running Jan's
program in a loop and cannot trigger it on my x86-64 80 core box), it
should have been seen earlier by some program/benchmark.
Thanks,
Davidlohr
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/