Re: [TIP][RFC 6/7] futex: add requeue_pi calls

From: Darren Hart
Date: Thu Mar 05 2009 - 20:42:47 EST


Darren Hart wrote:
Darren Hart wrote:
Darren Hart wrote:
From: Darren Hart <dvhltc@xxxxxxxxxx>

PI Futexes must have an owner at all times, so the standard requeue commands
aren't sufficient. The new commands properly manage pi futex ownership by
ensuring a futex with waiters has an owner at all times. Once complete these
patches will allow glibc to properly handle pi mutexes with pthread_condvars.

The approach taken here is to create two new futex op codes:

FUTEX_WAIT_REQUEUE_PI:
Threads will use this op code to wait on a futex (such as a non-pi waitqueue)
and wake after they have been requeued to a pi futex. Prior to returning to
userspace, they will take this pi futex (and the underlying rt_mutex).

futex_wait_requeue_pi() is currently the result of a high speed collision
between futex_wait and futex_lock_pi (with the first part of futex_lock_pi
being done by futex_requeue_pi_init() on behalf of the waiter).

FUTEX_REQUEUE_PI:
This call must be used to wake threads waiting with FUTEX_WAIT_REQUEUE_PI,
regardless of how many threads the caller intends to wake or requeue.
pthread_cond_broadcast should call this with nr_wake=1 and nr_requeue=-1 (all).
pthread_cond_signal should call this with nr_wake=1 and nr_requeue=0. The
reason being we need both callers to get the benefit of the
futex_requeue_pi_init() routine which will prepare the top_waiter (the thread
to be woken) to take possesion of the pi futex by setting FUTEX_WAITERS and
preparing the futex_q.pi_state. futex_requeue() also enqueues the top_waiter
on the rt_mutex via rt_mutex_start_proxy_lock(). If pthread_cond_signal used
FUTEX_WAKE, we would have a similar race window where the caller can return and
release the mutex before the waiters can fully wake, potentially leaving the
rt_mutex with waiters but no owner.

We hit a failed paging request running the testcase (7/7) in a loop
(only takes a few minutes at most to hit on my 8way x86_64 test
machine). It appears to be the result of splitting rt_mutex_slowlock()
across two execution contexts by means of rt_mutex_start_proxy_lock()
and rt_mutex_finish_proxy_lock(). The former calls
task_blocks_on_rt_mutex() on behalf of the waiting task prior to
requeuing and waking it by the requeueing thread. The latter is
executed upon wakeup by the waiting thread which somehow manages to call
the new __rt_mutex_slowlock() with waiter->task != NULL and still
succeed with try_to_take_lock(), this leads to corruption of the plists
and an eventual failed paging request. See 7/7 for the rather crude
testcase that causes this. Any tips on where this race might be
occuring are welcome.

<snip>


I've updated my tracing and can show that rt_mutex_start_proxy_lock() is not setting RT_MUTEX_HAS_WAITERS like it should be:

------------[ cut here ]------------
kernel BUG at kernel/rtmutex.c:646!
invalid opcode: 0000 [#1] PREEMPT SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/0000:01:00.0/host0/port-0:
0/end_device-0:0/target0:0:0/0:0:0:0/vendor
Dumping ftrace buffer:
---------------------------------
<...>-3793 1d..3 558351872us : lookup_pi_state: allocating a new pi state
<...>-3793 1d..3 558351876us : lookup_pi_state: initial rt_mutex owner: ffff88023d9486c0
<...>-3793 1...2 558351877us : futex_requeue: futex_lock_pi_atomic returned: 0
<...>-3793 1...2 558351877us : futex_requeue: futex_requeue_pi_init returned: 0
<...>-3793 1...3 558351879us : rt_mutex_start_proxy_lock: task_blocks_on_rt_mutex returned 0
<...>-3793 1...3 558351880us : rt_mutex_start_proxy_lock: lock has waiterflag: 0
<...>-3793 1...1 558351888us : rt_mutex_unlock: unlocking ffff88023b5f6950
<...>-3793 1...1 558351888us : rt_mutex_unlock: lock waiter flag: 0
<...>-3793 1...1 558351889us : rt_mutex_unlock: unlocked ffff88023b5f6950
<...>-3783 0...1 558351893us : __rt_mutex_slowlock: waiter->task is ffff88023c872440
<...>-3783 0...1 558351897us : try_to_take_rt_mutex: assigned rt_mutex (ffff88023b5f6950) owner to current ffff88023c872440
<...>-3783 0...1 558351897us : __rt_mutex_slowlock: got the lock
---------------------------------

I'll start digging into why that's happening, but I wanted to share the trace output.


As it turns out I missed setting RT_MUTEX_HAS_WAITERS on the rt_mutex in
rt_mutex_start_proxy_lock() - seems awfully silly in retrospect - but a
little non-obvious while writing it. I added mark_rt_mutex_waiters()
after the call to task_blocks_on_rt_mutex() and the test has completed
more than 400 iterations successfully (it would fail after no more than
2 most of the time before).

Steven, there are several ways to set RT_MUTEX_HAS_WAITERS - but this
seemed like a reasonable approach, would you agree? Since I'm holding
the wait_lock I don't technically need the atomic cmpxchg and could
probably just set it explicity - do you have a preference?


RFC: rt_mutex: add proxy lock routines

From: Darren Hart <dvhltc@xxxxxxxxxx>

This patch is required for the first half of requeue_pi to function. It
basically splits rt_mutex_slowlock() right down the middle, just before the
first call to schedule().

This patch uses a new futex_q field, rt_waiter, for now. I think
I should be able to use task->pi_blocked_on in a future versino of this patch.

V6: -add mark_rt_mutex_waiters() to rt_mutex_start_procy_lock() to avoid
the race condition evident in previous versions
V5: -remove EXPORT_SYMBOL_GPL from the new routines
-minor cleanups
V4: -made detect_deadlock a parameter to rt_mutex_enqueue_task
-refactored rt_mutex_slowlock to share code with new functions
-renamed rt_mutex_enqueue_task and rt_mutex_handle_wakeup to
rt_mutex_start_proxy_lock and rt_mutex_finish_proxy_lock, respectively

Signed-off-by: Darren Hart <dvhltc@xxxxxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: Sripathi Kodi <sripathik@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: John Stultz <johnstul@xxxxxxxxxx>
Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
---

kernel/rtmutex.c | 193 +++++++++++++++++++++++++++++++++++++----------
kernel/rtmutex_common.h | 8 ++
2 files changed, 161 insertions(+), 40 deletions(-)


diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index 69d9cb9..f438362 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -411,6 +411,7 @@ static int try_to_take_rt_mutex(struct rt_mutex *lock)
*/
static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
struct rt_mutex_waiter *waiter,
+ struct task_struct *task,
int detect_deadlock)
{
struct task_struct *owner = rt_mutex_owner(lock);
@@ -418,21 +419,21 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
unsigned long flags;
int chain_walk = 0, res;

- spin_lock_irqsave(&current->pi_lock, flags);
- __rt_mutex_adjust_prio(current);
- waiter->task = current;
+ spin_lock_irqsave(&task->pi_lock, flags);
+ __rt_mutex_adjust_prio(task);
+ waiter->task = task;
waiter->lock = lock;
- plist_node_init(&waiter->list_entry, current->prio);
- plist_node_init(&waiter->pi_list_entry, current->prio);
+ plist_node_init(&waiter->list_entry, task->prio);
+ plist_node_init(&waiter->pi_list_entry, task->prio);

/* Get the top priority waiter on the lock */
if (rt_mutex_has_waiters(lock))
top_waiter = rt_mutex_top_waiter(lock);
plist_add(&waiter->list_entry, &lock->wait_list);

- current->pi_blocked_on = waiter;
+ task->pi_blocked_on = waiter;

- spin_unlock_irqrestore(&current->pi_lock, flags);
+ spin_unlock_irqrestore(&task->pi_lock, flags);

if (waiter == rt_mutex_top_waiter(lock)) {
spin_lock_irqsave(&owner->pi_lock, flags);
@@ -460,7 +461,7 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
spin_unlock(&lock->wait_lock);

res = rt_mutex_adjust_prio_chain(owner, detect_deadlock, lock, waiter,
- current);
+ task);

spin_lock(&lock->wait_lock);

@@ -605,37 +606,25 @@ void rt_mutex_adjust_pi(struct task_struct *task)
rt_mutex_adjust_prio_chain(task, 0, NULL, NULL, task);
}

-/*
- * Slow path lock function:
+/**
+ * __rt_mutex_slowlock - perform the wait-wake-try-to-take loop
+ * @lock the rt_mutex to take
+ * @state: the state the task should block in (TASK_INTERRUPTIBLE
+ * or TASK_UNINTERRUPTIBLE)
+ * @timeout: the pre-initialized and started timer, or NULL for none
+ * @waiter: the pre-initialized rt_mutex_waiter
+ * @detect_deadlock: passed to task_blocks_on_rt_mutex
+ *
+ * lock->wait_lock must be held by the caller.
*/
static int __sched
-rt_mutex_slowlock(struct rt_mutex *lock, int state,
- struct hrtimer_sleeper *timeout,
- int detect_deadlock)
+__rt_mutex_slowlock(struct rt_mutex *lock, int state,
+ struct hrtimer_sleeper *timeout,
+ struct rt_mutex_waiter *waiter,
+ int detect_deadlock)
{
- struct rt_mutex_waiter waiter;
int ret = 0;

- debug_rt_mutex_init_waiter(&waiter);
- waiter.task = NULL;
-
- spin_lock(&lock->wait_lock);
-
- /* Try to acquire the lock again: */
- if (try_to_take_rt_mutex(lock)) {
- spin_unlock(&lock->wait_lock);
- return 0;
- }
-
- set_current_state(state);
-
- /* Setup the timer, when timeout != NULL */
- if (unlikely(timeout)) {
- hrtimer_start_expires(&timeout->timer, HRTIMER_MODE_ABS);
- if (!hrtimer_active(&timeout->timer))
- timeout->task = NULL;
- }
-
for (;;) {
/* Try to acquire the lock: */
if (try_to_take_rt_mutex(lock))
@@ -656,19 +645,19 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
}

/*
- * waiter.task is NULL the first time we come here and
+ * waiter->task is NULL the first time we come here and
* when we have been woken up by the previous owner
* but the lock got stolen by a higher prio task.
*/
- if (!waiter.task) {
- ret = task_blocks_on_rt_mutex(lock, &waiter,
+ if (!waiter->task) {
+ ret = task_blocks_on_rt_mutex(lock, waiter, current,
detect_deadlock);
/*
* If we got woken up by the owner then start loop
* all over without going into schedule to try
* to get the lock now:
*/
- if (unlikely(!waiter.task)) {
+ if (unlikely(!waiter->task)) {
/*
* Reset the return value. We might
* have returned with -EDEADLK and the
@@ -684,15 +673,52 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,

spin_unlock(&lock->wait_lock);

- debug_rt_mutex_print_deadlock(&waiter);
+ debug_rt_mutex_print_deadlock(waiter);

- if (waiter.task)
+ if (waiter->task)
schedule_rt_mutex(lock);

spin_lock(&lock->wait_lock);
set_current_state(state);
}

+ return ret;
+}
+
+/*
+ * Slow path lock function:
+ */
+static int __sched
+rt_mutex_slowlock(struct rt_mutex *lock, int state,
+ struct hrtimer_sleeper *timeout,
+ int detect_deadlock)
+{
+ struct rt_mutex_waiter waiter;
+ int ret = 0;
+
+ debug_rt_mutex_init_waiter(&waiter);
+ waiter.task = NULL;
+
+ spin_lock(&lock->wait_lock);
+
+ /* Try to acquire the lock again: */
+ if (try_to_take_rt_mutex(lock)) {
+ spin_unlock(&lock->wait_lock);
+ return 0;
+ }
+
+ set_current_state(state);
+
+ /* Setup the timer, when timeout != NULL */
+ if (unlikely(timeout)) {
+ hrtimer_start_expires(&timeout->timer, HRTIMER_MODE_ABS);
+ if (!hrtimer_active(&timeout->timer))
+ timeout->task = NULL;
+ }
+
+ ret = __rt_mutex_slowlock(lock, state, timeout, &waiter,
+ detect_deadlock);
+
set_current_state(TASK_RUNNING);

if (unlikely(waiter.task))
@@ -986,6 +1012,42 @@ void rt_mutex_proxy_unlock(struct rt_mutex *lock,
}

/**
+ * rt_mutex_start_proxy_lock - prepare another task to take the lock
+ *
+ * @lock: the rt_mutex to take
+ * @waiter: the rt_mutex_waiter initialized by the waiter
+ * @task: the task to prepare
+ * @detext_deadlock: passed to task_blocks_on_rt_mutex
+ *
+ * The lock should have an owner, and it should not be task.
+ * Special API call for FUTEX_REQUEUE_PI support.
+ */
+int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter,
+ struct task_struct *task, int detect_deadlock)
+{
+ int ret;
+
+ spin_lock(&lock->wait_lock);
+ ret = task_blocks_on_rt_mutex(lock, waiter, task, detect_deadlock);
+ mark_rt_mutex_waiters(lock);
+ if (ret && !waiter->task) {
+ /*
+ * Reset the return value. We might have
+ * returned with -EDEADLK and the owner
+ * released the lock while we were walking the
+ * pi chain. Let the waiter sort it out.
+ */
+ ret = 0;
+ }
+ spin_unlock(&lock->wait_lock);
+
+ debug_rt_mutex_print_deadlock(waiter);
+
+ return ret;
+}
+
+/**
* rt_mutex_next_owner - return the next owner of the lock
*
* @lock: the rt lock query
@@ -1004,3 +1066,54 @@ struct task_struct *rt_mutex_next_owner(struct rt_mutex *lock)

return rt_mutex_top_waiter(lock)->task;
}
+
+/**
+ * rt_mutex_finish_proxy_lock - Complete the taking of the lock initialized on
+ * our behalf by another thread.
+ * @lock: the rt_mutex we were woken on
+ * @to: the timeout, null if none. hrtimer should already have been started.
+ * @waiter: the pre-initialized rt_mutex_waiter
+ * @detect_deadlock: for use by __rt_mutex_slowlock
+ *
+ * Special API call for PI-futex requeue support
+ */
+int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
+ struct hrtimer_sleeper *to,
+ struct rt_mutex_waiter *waiter,
+ int detect_deadlock)
+{
+ int ret;
+
+ if (waiter->task)
+ schedule_rt_mutex(lock);
+
+ spin_lock(&lock->wait_lock);
+
+ set_current_state(TASK_INTERRUPTIBLE);
+
+ ret = __rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, to, waiter,
+ detect_deadlock);
+
+ set_current_state(TASK_RUNNING);
+
+ if (unlikely(waiter->task))
+ remove_waiter(lock, waiter);
+
+ /*
+ * try_to_take_rt_mutex() sets the waiter bit unconditionally. We might
+ * have to fix that up.
+ */
+ fixup_rt_mutex_waiters(lock);
+
+ spin_unlock(&lock->wait_lock);
+
+ /*
+ * Readjust priority, when we did not get the lock. We might have been
+ * the pending owner and boosted. Since we did not take the lock, the
+ * PI boost has to go.
+ */
+ if (unlikely(ret))
+ rt_mutex_adjust_prio(current);
+
+ return ret;
+}
diff --git a/kernel/rtmutex_common.h b/kernel/rtmutex_common.h
index e124bf5..97a2f81 100644
--- a/kernel/rtmutex_common.h
+++ b/kernel/rtmutex_common.h
@@ -120,6 +120,14 @@ extern void rt_mutex_init_proxy_locked(struct rt_mutex *lock,
struct task_struct *proxy_owner);
extern void rt_mutex_proxy_unlock(struct rt_mutex *lock,
struct task_struct *proxy_owner);
+extern int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
+ struct rt_mutex_waiter *waiter,
+ struct task_struct *task,
+ int detect_deadlock);
+extern int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
+ struct hrtimer_sleeper *to,
+ struct rt_mutex_waiter *waiter,
+ int detect_deadlock);

#ifdef CONFIG_DEBUG_RT_MUTEXES
# include "rtmutex-debug.h"

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/