Re: [patch] futex: fix fault handling in futex_lock_pi

From: Peter Zijlstra
Date: Mon Jun 23 2008 - 10:03:52 EST


On Mon, 2008-06-23 at 13:25 +0200, Thomas Gleixner wrote:
> On Mon, 23 Jun 2008, Thomas Gleixner wrote:
>
> Ingo asked about more information about the BUG. Find below the same
> patch with an updated commit log.
>
> Thanks,
> tglx
> ------------------->
> Date: Mon, 23 Jun 2008 11:21:58 +0200
> From: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Subject: [patch] futex: fix fault handling in futex_lock_pi
>
> This patch addresses a very sporadic pi-futex related failure in
> highly threaded java apps on large SMP systems.
>
> David Holmes reported that the pi_state consistency check in
> lookup_pi_state triggered with his test application. This means that
> the kernel internal pi_state and the user space futex variable are out
> of sync. First we assumed that this is a user space data corruption,
> but deeper investigation revieled that the problem happend because the
> pi-futex code is not handling a fault in the futex_lock_pi path when
> the user space variable needs to be fixed up.
>
> The fault happens when a fork mapped the anon memory which contains
> the futex readonly for COW or the page got swapped out exactly between
> the unlock of the futex and the return of either the new futex owner
> or the task which was the expected owner but failed to acquire the
> kernel internal rtmutex. The current futex_lock_pi() code drops out
> with an inconsistent in case it faults and returns -EFAULT to user
> space. User space has no way to fixup that state.
>
> When we wrote this code we thought that we could not drop the hash
> bucket lock at this point to handle the fault.
>
> After analysing the code again it turned out to be wrong because there
> are only two tasks involved which might modify the pi_state and the
> user space variable:
>
> - the task which acquired the rtmutex
> - the pending owner of the pi_state which did not get the rtmutex
>
> Both tasks drop into the fixup_pi_state() function before returning to
> user space. The first task which acquired the hash bucket lock faults
> in the fixup of the user space variable, drops the spinlock and calls
> futex_handle_fault() to fault in the page. Now the second task could
> acquire the hash bucket lock and tries to fixup the user space
> variable as well. It either faults as well or it succeeds because the
> first task already faulted the page in.
>
> One caveat is to avoid a double fixup. After returning from the fault
> handling we reacquire the hash bucket lock and check whether the
> pi_state owner has been modified already.
>
> Reported-by: David Holmes <david.holmes@xxxxxxx>
> Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxx>
> Cc: David Holmes <david.holmes@xxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>

Acked-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx>

> Cc: <stable@xxxxxxxxxx>
>
> ---
> kernel/futex.c | 93 ++++++++++++++++++++++++++++++++++++++++++++-------------
> 1 file changed, 73 insertions(+), 20 deletions(-)
>
> Index: linux-2.6/kernel/futex.c
> ===================================================================
> --- linux-2.6.orig/kernel/futex.c
> +++ linux-2.6/kernel/futex.c
> @@ -1096,21 +1096,64 @@ static void unqueue_me_pi(struct futex_q
> * private futexes.
> */
> static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
> - struct task_struct *newowner)
> + struct task_struct *newowner,
> + struct rw_semaphore *fshared)
> {
> u32 newtid = task_pid_vnr(newowner) | FUTEX_WAITERS;
> struct futex_pi_state *pi_state = q->pi_state;
> + struct task_struct *oldowner = pi_state->owner;
> u32 uval, curval, newval;
> - int ret;
> + int ret, attempt = 0;
>
> /* Owner died? */
> + if (!pi_state->owner)
> + newtid |= FUTEX_OWNER_DIED;
> +
> + /*
> + * We are here either because we stole the rtmutex from the
> + * pending owner or we are the pending owner which failed to
> + * get the rtmutex. We have to replace the pending owner TID
> + * in the user space variable. This must be atomic as we have
> + * to preserve the owner died bit here.
> + *
> + * Note: We write the user space value _before_ changing the
> + * pi_state because we can fault here. Imagine swapped out
> + * pages or a fork, which was running right before we acquired
> + * mmap_sem, that marked all the anonymous memory readonly for
> + * cow.
> + *
> + * Modifying pi_state _before_ the user space value would
> + * leave the pi_state in an inconsistent state when we fault
> + * here, because we need to drop the hash bucket lock to
> + * handle the fault. This might be observed in the PID check
> + * in lookup_pi_state.
> + */
> +retry:
> + if (get_futex_value_locked(&uval, uaddr))
> + goto handle_fault;
> +
> + while (1) {
> + newval = (uval & FUTEX_OWNER_DIED) | newtid;
> +
> + curval = cmpxchg_futex_value_locked(uaddr, uval, newval);
> +
> + if (curval == -EFAULT)
> + goto handle_fault;
> + if (curval == uval)
> + break;
> + uval = curval;
> + }
> +
> + /*
> + * We fixed up user space. Now we need to fix the pi_state
> + * itself.
> + */
> if (pi_state->owner != NULL) {
> spin_lock_irq(&pi_state->owner->pi_lock);
> WARN_ON(list_empty(&pi_state->list));
> list_del_init(&pi_state->list);
> spin_unlock_irq(&pi_state->owner->pi_lock);
> - } else
> - newtid |= FUTEX_OWNER_DIED;
> + }
>
> pi_state->owner = newowner;
>
> @@ -1118,26 +1161,35 @@ static int fixup_pi_state_owner(u32 __us
> WARN_ON(!list_empty(&pi_state->list));
> list_add(&pi_state->list, &newowner->pi_state_list);
> spin_unlock_irq(&newowner->pi_lock);
> + return 0;
>
> /*
> - * We own it, so we have to replace the pending owner
> - * TID. This must be atomic as we have preserve the
> - * owner died bit here.
> + * To handle the page fault we need to drop the hash bucket
> + * lock here. That gives the other task (either the pending
> + * owner itself or the task which stole the rtmutex) the
> + * chance to try the fixup of the pi_state. So once we are
> + * back from handling the fault we need to check the pi_state
> + * after reacquiring the hash bucket lock and before trying to
> + * do another fixup. When the fixup has been done already we
> + * simply return.
> */
> - ret = get_futex_value_locked(&uval, uaddr);
> +handle_fault:
> + spin_unlock(q->lock_ptr);
>
> - while (!ret) {
> - newval = (uval & FUTEX_OWNER_DIED) | newtid;
> + ret = futex_handle_fault((unsigned long)uaddr, fshared, attempt++);
>
> - curval = cmpxchg_futex_value_locked(uaddr, uval, newval);
> + spin_lock(q->lock_ptr);
>
> - if (curval == -EFAULT)
> - ret = -EFAULT;
> - if (curval == uval)
> - break;
> - uval = curval;
> - }
> - return ret;
> + /*
> + * Check if someone else fixed it for us:
> + */
> + if (pi_state->owner != oldowner)
> + return 0;
> +
> + if (ret)
> + return ret;
> +
> + goto retry;
> }
>
> /*
> @@ -1507,7 +1559,7 @@ static int futex_lock_pi(u32 __user *uad
> * that case:
> */
> if (q.pi_state->owner != curr)
> - ret = fixup_pi_state_owner(uaddr, &q, curr);
> + ret = fixup_pi_state_owner(uaddr, &q, curr, fshared);
> } else {
> /*
> * Catch the rare case, where the lock was released
> @@ -1539,7 +1591,8 @@ static int futex_lock_pi(u32 __user *uad
> int res;
>
> owner = rt_mutex_owner(&q.pi_state->pi_mutex);
> - res = fixup_pi_state_owner(uaddr, &q, owner);
> + res = fixup_pi_state_owner(uaddr, &q, owner,
> + fshared);
>
> /* propagate -EFAULT, if the fixup failed */
> if (res)
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/