Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
From: Peter Zijlstra
Date: Wed Nov 23 2016 - 15:19:34 EST
On Thu, Oct 27, 2016 at 10:36:00PM +0200, Thomas Gleixner wrote:
> > + new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
> > + if (!new_owner) {
> > + /*
> > + * This is the case where futex_lock_pi() has not yet or failed
> > + * to acquire the lock but still has the futex_q enqueued. So
> > + * the futex state has a 'waiter' while the rt_mutex state does
> > + * not.
> > + *
> > + * Even though there still is pi_state for this futex, we can
> > + * clear FUTEX_WAITERS. Either:
> > + *
> > + * - we or futex_lock_pi() will drop the last reference and
> > + * clean up this pi_state,
> > + *
> > + * - userspace acquires the futex through its fastpath
> > + * and the above pi_state cleanup still happens,
> > + *
> > + * - or futex_lock_pi() will re-set the WAITERS bit in
> > + * fixup_owner().
> > + */
> > + newval = 0;
> > + /*
> > + * Since pi_state->owner must point to a valid task, and
> > + * task_pid_vnr(pi_state->owner) must match TID_MASK, use
> > + * init_task.
> > + */
> > + new_owner = &init_task;
>
> So that waiter which is now spinning on pi_mutex->lock has already set the
> waiters bit, which you undo. So you created the following scenario:
>
> CPU0 CPU1 CPU2
>
> TID 0x1001 TID 0x1000 TID 0x1002
>
> Acquires futex in user space
> futex value = 0x1000;
>
> futex_lock_pi()
>
> lock_hb()
> set_waiter_bit()
> --> futex value = 0x40001000;
>
> futex_unlock_pi()
> lock_hb()
> attach_to_pi_owner()
> rt_mutex_init_proxy_locked()
> queue_me()
> unlock_hb()
> unlock_hb()
> rt_mutex_lock() wake_futex_pi()
> lock(pi_mutex->lock);
> lock(pi_mutex->lock) new_owner is NULL;
> --> futex value = 0;
> rt_mutex_futex_unlock(pi_mutex);
> unlock(pi_mutex->lock);
> acquire_rt_mutex() return to user space
> Acquires futex in user space
> --> futex value = 0x1002
> fixup_owner()
> fixup_pi_state_owner()
> uval = 0x1002;
> newval = 0x40001001;
> cmpxchg(uval, newval) succeeds
> --> futex value = 0x40001001
>
> Voila. Inconsistent state .... TID 0x1002 is borked now.
Urgh, right.
> The other option we have is to set the futex value to FUTEX_WAITERS instead
> of 0.
Yeah, I initially did that but didn't really like it, I then went on to
convince myself setting it to 0 was good, silly me. Leaking the WAITERS
bit is _by_far_ the simplest option though.
Ok, I went and implemented various broken and discarded alternatives
while re-learning all about futexes that I forgot the past few weeks,
while trying to figure out wtf the below does.
I also tried to create some 4+ CPU races that would hit holes in the
below, no luck so far.
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -2246,16 +2246,27 @@ static int fixup_pi_state_owner(u32 __us
> if (get_futex_value_locked(&uval, uaddr))
> goto handle_fault;
>
> + /*
> + * If wake_futex_pi() set the futex to 0 and made init_task the
> + * transient owner another task might have acquired the futex
> + * in user space.
> + */
True, but that doesn't explain why we do this. Naively leaving
pi_state->owner set while returning EAGAIN shouldn't be a problem
because put_pi_state() should clean that up.
_However_, that does rt_mutex_proxy_unlock() on it as well, and _that_
is a problem, because it's not the rt_mutex owner.
Then again, we can hit this very problem through any of the other
put_pi_state() calls after setting ->owner = &init_task I think. Which
would argue to special case this in put_pi_state() instead.
> + if (oldowner == &init_task && uval != 0) {
> + raw_spin_lock(&pi_state->owner->pi_lock);
> + list_del_init(&pi_state->list);
> + raw_spin_unlock(&pi_state->owner->pi_lock);
> + pi_state->owner = NULL;
> + return -EAGAIN;
> }
>
> + newval = (uval & FUTEX_OWNER_DIED) | newtid;
> +
> + if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval))
> + goto handle_fault;
> +
> + if (curval != uval)
> + goto retry;
This is slightly weird in that we loose the obvious cmpxchg loop
construct. So I'd write it differently, also that
get_futex_value_locked() call is entirely superfluous the second time
around, we got curval after all.
> +
> /*
> * We fixed up user space. Now we need to fix the pi_state
> * itself.
> @@ -2679,6 +2690,10 @@ static int futex_lock_pi(u32 __user *uad
>
> out_put_key:
> put_futex_key(&q.key);
> +
> + if (ret == -EAGAIN)
> + goto retry;
> +
And this is far too clever and really needs a comment. So the crucial
point is that this is after unqueue_me_pi(), which drops the pi_state
and loops back to lookup the pi_state again, which, hopefully, has now
been completely destroyed -- and therefore we hit the regular
attach_to_pi_owner() path, fixing up our 'funny' state.
This is where I was playing with 4+ CPU scenarios, to see if we could
somehow keep the pi_state alive and not make progress.
I think we can do the retry slightly earlier, right after
unqueue_me_pi() and then add retry_queue: right before
futex_lock_pi_atomic(), that would avoid dropping the hb->lock (and
avoids put/get_futex_key).
> out:
> if (to)
> destroy_hrtimer_on_stack(&to->timer);
Also, since fixup_pi_state_owner() can now return -EAGAIN, all users
need handling for that.
I'll try and do a patch that does all that and attempt to write coherent
comments on our fancy new state.