Re: [RFC][PATCH] fixup pi_state in futex_requeue on lock steal

From: Darren Hart
Date: Thu Aug 06 2009 - 20:37:16 EST


Steven Rostedt wrote:
On Thu, 6 Aug 2009, Darren Hart wrote:

Peter Zijlstra wrote:
On Wed, 2009-08-05 at 17:01 -0700, Darren Hart wrote:
NOT FOR INCLUSION

Fixup the uval and pi_state owner in futex_requeue(requeue_pi=1) in the
event
of a lock steal or owner died. I had hoped to leave it up to the new
owner to
fix up the userspace value since we can't really handle a fault here
gracefully. This should be safe as the lock is contended and should force
all
userspace attempts to lock or unlock into the kernel where they'll block
on the
hb lock. However, when I don't update the uaddr, I hit the WARN_ON(pid !=
pi_state->owner->pid) as expected, and the userspace testcase deadlocks.

I need to try and better understand what's happening to hang userspace.
In the
meantime I thought I'd share what I'm working with atm. This is a
complete HACK
and is ugly, non-modular, etc etc. However, it currently works. It would
explode
in a most impressive fashion should we happen to fault. So the remaining
questions
are:

o Why does userspace deadlock if we leave the uval updating to the new
owner
waking up in futex_wait_requeue_pi()?

o If we have to handle a fault in futex_requeue(), how can we best cleanup
the
proxy lock acquisition and get things back into a sane state. We
faulted, so
determinism is out the window anyway, we just need to recover
gracefully.

Do you have a trace of the thing going down?
I finally did get a trace... but learned something in the process. Elaborating
below.

I'm assuming you used ftrace as your tracing infrastructure?

:-) Yes, ftrace with the nop tracer and some trace_printk worked nicely. I turned on the trace from userspace and stopped the trace from within the kernel using tracing_off(). But, since the bug didn't actually exist, I never actually hit tracing_off(). The trace_printk's however nicely highlighted the "race window" that wasn't actually a window ;-)

Thanks for the ring buffer fixes btw.

--
Darren


-- Steve

Tglx and me usually use sched_switch and a few trace_printk()s sprinkled
around, the typical one would be in sys_futex, printing the futex cmd
and arg.

OK, so run me through this one more time.

A condvar has two futexes, an inner and an outer. The inner futex is
always locked and the waiting threads are stacked on that.
3 actually:

cond->data->futex (the waitqueue)
cond->data->lock (the lock protecting the internal data)
outer mutex (the pthread_mutex)

Then on signal/broadcast, we lock the outer lock and requeue all the
blocked tasks from the inner to the outer, then we release the outer
lock and let them rip.
Yes - and in requeue_pi with a PI mutex we only let 1 rip, and requeue the
rest, rather than wake them all as the old implementation for PI mutexes did.

Since we're seeing lock steals, I'm thinking the outer lock isn't taken
when we're doing the requeue?
Correct. Unfortunately this is "valid" usage.

Anyway, during the requeue we lock-steal because the owner isn't running
yet and we iterate a higher prio task in the requeue loop?
I believe so.

This leaves the outer lock's futex field messed up because it points to
the wrong TID.
The futex uval isn't messed up, it just still hold the value of the previous
owners tid (not the expected owner we're stealing from). I believe now that
this is proper behavior.

After we finish the requeue loop, we unlock the HBs.


So far so good?
Yup.


Now, normally the waking thread will find itself owner and will check
the futex variable and fix her up -- while holding the HB lock.
Correcto.

However, in case the outer lock gets contended again, we can get
interrupted between requeue and wakeup/fixup and observe this messed up
futex value, which is causing this WARN to trigger.
This is where I was mistaken. I had seen the WARN_ON(pid !=
pi_state->owner->pid) in lookup_pi_state() while working on the previous 2
patches I sent to the list. The one which updates the lock_ptr of the futex_q
to match that of the pi_state should fix this. What happened before was we
would grab the wrong hb lock so while we were fixing up the pi_state and uval
in the woken thread, a contending thread would read those value while holding
the correct hb lock. That race is fixed with the "[PATCH 1/2] Update woken
requeued futex_q lock_ptr" patch.

So where do we deadlock, after this all goes down? Do we perhaps lookup
the wrong pi_state using that wrong TID?

We only deadlocked while I was (wrongly) trying to update pi_state owner from
the requeue thread. Deadlocks don't occur in my testing with only patches 1
and 2.

[PATCH 1/2] Update woken requeued futex_q lock_ptr" patch
[PATCH 2/2][RT] Avoid deadlock in rt_mutex_start_proxy_lock()


Since its only the outer futex's value that matters, right? Can't we pin
that using get_user_pages() before we take the HB lock and go into the
requeue loop? That way we're sure to be able to change it without
faulting.
I now don't believe we have to do this. In fact, futex_lock_pi() exhibits a
similar "race window" (simplified below):

/*
* Block on the PI mutex:
*/
ret = rt_mutex_timed_lock(&q.pi_state->pi_mutex, to, 1);

[RACE WINDOW ] (not really, see below)

spin_lock(q.lock_ptr);
/*
* Fixup the pi_state owner and possibly acquire the lock if we
* haven't already.
*/
res = fixup_owner(uaddr, fshared, &q, !ret);

Note that the rt_mutex is acquire while the q.lock_ptr (hb->lock) is not held
(since we can sleep). This is FINE and not a race. Lets look at what happens
if another task tries to get the lock during that time:

futex_lock_pi
futex_lock_pi_atomic
lookup_pi_state

At this point we have the pi_state. It's owner field will point to the
previous owner, not the task that is currently acquiring it. But the rt_mutex
itself knows who owns it, so proper boosting should still occur. Once the new
owner complete the pi_state update, the pi_state will be removed from the old
owner pi_state_list and added to its pi_state_list. Since the futex uval
shows it's owned in both cases, the new contender is still forced into the
kernel to block on the rt_mutex. Since we update the uval, then the
pi_state->owner, we are sure to be able to access the rt_mutex via the old
uval so long as we hold the hb->lock.

So, I think we're fine with respect to the pi_state ownership! In fact I
finally managed to catch the lock steal in the requeue loop in my tracing, and
everything worked fine. Going to go rerun a bunch more tests and see if I hit
any other issues, if I do, I suspect they are unrelated to this.

Thanks for the help in thinking this through.

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team



--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/