Re: [RFC][PATCH 4/4] futex: Rewrite FUTEX_UNLOCK_PI
From: Thomas Gleixner
Date: Thu Oct 27 2016 - 16:39:00 EST
On Fri, 21 Oct 2016, Peter Zijlstra wrote:
> On Mon, Oct 10, 2016 at 12:17:48PM +0200, Thomas Gleixner wrote:
> > I fear, we need to rethink this whole locking/protection scheme from
> > scratch.
>
> Here goes... as discussed at ELCE this serializes the {uval,
> pi_state} state using pi_mutex->wait_lock.
>
> I did my best to reason about requeue_pi, but I did suffer some stack
> overflows. That said, I audited all places pi_state is used/modified
> and applied consistent locking and/or comments.
>
> The patch could possibly be broken up in 3 parts, namely:
>
> - introduce the futex specific rt_mutex wrappers that avoid the
> regular rt_mutex fast path (and isolate futex from the normal
> rt_mutex interface).
>
> - add the pi_mutex->wait_lock locking, which at that point will be
> entirely redundant.
>
> - rework the unlock_pi path to drop hb->lock.
>
> Let me know if you would prefer to see that. For review I think having
> all that in a single patch is easier to reason about.
For merging and final review, yes please.
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -914,7 +914,7 @@ void exit_pi_state_list(struct task_stru
> pi_state->owner = NULL;
> raw_spin_unlock_irq(&curr->pi_lock);
>
> - rt_mutex_unlock(&pi_state->pi_mutex);
> + rt_mutex_futex_unlock(&pi_state->pi_mutex);
>
> spin_unlock(&hb->lock);
>
> @@ -963,7 +963,9 @@ void exit_pi_state_list(struct task_stru
> *
> * [7] pi_state->owner can only be NULL when the OWNER_DIED bit is set.
> *
> - * [8] Owner and user space value match
> + * [8] Owner and user space value match; there is a special case in
> + * wake_futex_pi() where we use pi_state->owner = &init_task to
> + * make this true for TID 0 and avoid rules 9 and 10.
So you create a transient state with TID 0, but without setting the waiter
bit, right? I'll comment on that in the code below.
> *
> * [9] There is no transient state which sets the user space TID to 0
> * except exit_robust_list(), but this is indicated by the
> -static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *top_waiter,
> - struct futex_hash_bucket *hb)
> +static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_pi_state *pi_state)
> {
> - struct task_struct *new_owner;
> - struct futex_pi_state *pi_state = top_waiter->pi_state;
> u32 uninitialized_var(curval), newval;
> + struct task_struct *new_owner;
> + bool deboost = false;
> WAKE_Q(wake_q);
> - bool deboost;
> int ret = 0;
>
> - if (!pi_state)
> - return -EINVAL;
> -
> - /*
> - * If current does not own the pi_state then the futex is
> - * inconsistent and user space fiddled with the futex value.
> - */
> - if (pi_state->owner != current)
> - return -EINVAL;
> -
> raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
> - new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
>
> - /*
> - * It is possible that the next waiter (the one that brought
> - * top_waiter owner to the kernel) timed out and is no longer
> - * waiting on the lock.
> - */
> - if (!new_owner)
> - new_owner = top_waiter->task;
> -
> - /*
> - * We pass it to the next owner. The WAITERS bit is always
> - * kept enabled while there is PI state around. We cleanup the
> - * owner died bit, because we are the owner.
> - */
> - newval = FUTEX_WAITERS | task_pid_vnr(new_owner);
> + new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
> + if (!new_owner) {
> + /*
> + * This is the case where futex_lock_pi() has not yet or failed
> + * to acquire the lock but still has the futex_q enqueued. So
> + * the futex state has a 'waiter' while the rt_mutex state does
> + * not.
> + *
> + * Even though there still is pi_state for this futex, we can
> + * clear FUTEX_WAITERS. Either:
> + *
> + * - we or futex_lock_pi() will drop the last reference and
> + * clean up this pi_state,
> + *
> + * - userspace acquires the futex through its fastpath
> + * and the above pi_state cleanup still happens,
> + *
> + * - or futex_lock_pi() will re-set the WAITERS bit in
> + * fixup_owner().
> + */
> + newval = 0;
> + /*
> + * Since pi_state->owner must point to a valid task, and
> + * task_pid_vnr(pi_state->owner) must match TID_MASK, use
> + * init_task.
> + */
> + new_owner = &init_task;
So that waiter which is now spinning on pi_mutex->lock has already set the
waiters bit, which you undo. So you created the following scenario:
CPU0 CPU1 CPU2
TID 0x1001 TID 0x1000 TID 0x1002
Acquires futex in user space
futex value = 0x1000;
futex_lock_pi()
lock_hb()
set_waiter_bit()
--> futex value = 0x40001000;
futex_unlock_pi()
lock_hb()
attach_to_pi_owner()
rt_mutex_init_proxy_locked()
queue_me()
unlock_hb()
unlock_hb()
rt_mutex_lock() wake_futex_pi()
lock(pi_mutex->lock);
lock(pi_mutex->lock) new_owner is NULL;
--> futex value = 0;
rt_mutex_futex_unlock(pi_mutex);
unlock(pi_mutex->lock);
acquire_rt_mutex() return to user space
Acquires futex in user space
--> futex value = 0x1002
fixup_owner()
fixup_pi_state_owner()
uval = 0x1002;
newval = 0x40001001;
cmpxchg(uval, newval) succeeds
--> futex value = 0x40001001
Voila. Inconsistent state .... TID 0x1002 is borked now.
There is no check in fixup_pi_state_owner() for this case as we until now
rightfully forced consistent state.
So with this new transient state we really need to check whether uval is
what we expect it to be, i.e. 0. The completely untested and probably
incorrect patch below might cure the issue in futex_lock_pi(), but does not
handle any of the requeue_pi mess.
The other option we have is to set the futex value to FUTEX_WAITERS instead
of 0. Have not looked at that variant yet, but it might be simpler as it
forces the other task into the kernel instead of giving it the oportunity
to steal the futex in user space. Just assumptions as brain melt has
already set in.
In any case we must carefully document this new transient state with all
its extra bells and whistels.
That transient state 0 made me nervous from the very beginning and I should
have seen the hole earlier...
Sorry for spoiling the party once again!
Thanks,
tglx
8<------------------------
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2246,16 +2246,27 @@ static int fixup_pi_state_owner(u32 __us
if (get_futex_value_locked(&uval, uaddr))
goto handle_fault;
- while (1) {
- newval = (uval & FUTEX_OWNER_DIED) | newtid;
-
- if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval))
- goto handle_fault;
- if (curval == uval)
- break;
- uval = curval;
+ /*
+ * If wake_futex_pi() set the futex to 0 and made init_task the
+ * transient owner another task might have acquired the futex
+ * in user space.
+ */
+ if (oldowner == &init_task && uval != 0) {
+ raw_spin_lock(&pi_state->owner->pi_lock);
+ list_del_init(&pi_state->list);
+ raw_spin_unlock(&pi_state->owner->pi_lock);
+ pi_state->owner = NULL;
+ return -EAGAIN;
}
+ newval = (uval & FUTEX_OWNER_DIED) | newtid;
+
+ if (cmpxchg_futex_value_locked(&curval, uaddr, uval, newval))
+ goto handle_fault;
+
+ if (curval != uval)
+ goto retry;
+
/*
* We fixed up user space. Now we need to fix the pi_state
* itself.
@@ -2679,6 +2690,10 @@ static int futex_lock_pi(u32 __user *uad
out_put_key:
put_futex_key(&q.key);
+
+ if (ret == -EAGAIN)
+ goto retry;
+
out:
if (to)
destroy_hrtimer_on_stack(&to->timer);