Re: [RFC][PATCH] fixup pi_state in futex_requeue on lock steal

From: Steven Rostedt
Date: Thu Aug 06 2009 - 18:53:40 EST



On Thu, 6 Aug 2009, Darren Hart wrote:

> Peter Zijlstra wrote:
> > On Wed, 2009-08-05 at 17:01 -0700, Darren Hart wrote:
> > > NOT FOR INCLUSION
> > >
> > > Fixup the uval and pi_state owner in futex_requeue(requeue_pi=1) in the
> > > event
> > > of a lock steal or owner died. I had hoped to leave it up to the new
> > > owner to
> > > fix up the userspace value since we can't really handle a fault here
> > > gracefully. This should be safe as the lock is contended and should force
> > > all
> > > userspace attempts to lock or unlock into the kernel where they'll block
> > > on the
> > > hb lock. However, when I don't update the uaddr, I hit the WARN_ON(pid !=
> > > pi_state->owner->pid) as expected, and the userspace testcase deadlocks.
> > >
> > > I need to try and better understand what's happening to hang userspace.
> > > In the
> > > meantime I thought I'd share what I'm working with atm. This is a
> > > complete HACK
> > > and is ugly, non-modular, etc etc. However, it currently works. It would
> > > explode
> > > in a most impressive fashion should we happen to fault. So the remaining
> > > questions
> > > are:
> > >
> > > o Why does userspace deadlock if we leave the uval updating to the new
> > > owner
> > > waking up in futex_wait_requeue_pi()?
> > >
> > > o If we have to handle a fault in futex_requeue(), how can we best cleanup
> > > the
> > > proxy lock acquisition and get things back into a sane state. We
> > > faulted, so
> > > determinism is out the window anyway, we just need to recover
> > > gracefully.
> >
> >
> > Do you have a trace of the thing going down?
>
> I finally did get a trace... but learned something in the process. Elaborating
> below.

I'm assuming you used ftrace as your tracing infrastructure?

-- Steve

>
> >
> > Tglx and me usually use sched_switch and a few trace_printk()s sprinkled
> > around, the typical one would be in sys_futex, printing the futex cmd
> > and arg.
> >
> > OK, so run me through this one more time.
> >
> > A condvar has two futexes, an inner and an outer. The inner futex is
> > always locked and the waiting threads are stacked on that.
>
> 3 actually:
>
> cond->data->futex (the waitqueue)
> cond->data->lock (the lock protecting the internal data)
> outer mutex (the pthread_mutex)
>
> >
> > Then on signal/broadcast, we lock the outer lock and requeue all the
> > blocked tasks from the inner to the outer, then we release the outer
> > lock and let them rip.
>
> Yes - and in requeue_pi with a PI mutex we only let 1 rip, and requeue the
> rest, rather than wake them all as the old implementation for PI mutexes did.
>
> >
> > Since we're seeing lock steals, I'm thinking the outer lock isn't taken
> > when we're doing the requeue?
>
> Correct. Unfortunately this is "valid" usage.
>
> >
> > Anyway, during the requeue we lock-steal because the owner isn't running
> > yet and we iterate a higher prio task in the requeue loop?
>
> I believe so.
>
> >
> > This leaves the outer lock's futex field messed up because it points to
> > the wrong TID.
>
> The futex uval isn't messed up, it just still hold the value of the previous
> owners tid (not the expected owner we're stealing from). I believe now that
> this is proper behavior.
>
> >
> > After we finish the requeue loop, we unlock the HBs.
> >
> >
> > So far so good?
>
> Yup.
>
> >
> >
> > Now, normally the waking thread will find itself owner and will check
> > the futex variable and fix her up -- while holding the HB lock.
>
> Correcto.
>
> >
> > However, in case the outer lock gets contended again, we can get
> > interrupted between requeue and wakeup/fixup and observe this messed up
> > futex value, which is causing this WARN to trigger.
>
> This is where I was mistaken. I had seen the WARN_ON(pid !=
> pi_state->owner->pid) in lookup_pi_state() while working on the previous 2
> patches I sent to the list. The one which updates the lock_ptr of the futex_q
> to match that of the pi_state should fix this. What happened before was we
> would grab the wrong hb lock so while we were fixing up the pi_state and uval
> in the woken thread, a contending thread would read those value while holding
> the correct hb lock. That race is fixed with the "[PATCH 1/2] Update woken
> requeued futex_q lock_ptr" patch.
>
> >
> > So where do we deadlock, after this all goes down? Do we perhaps lookup
> > the wrong pi_state using that wrong TID?
> >
>
> We only deadlocked while I was (wrongly) trying to update pi_state owner from
> the requeue thread. Deadlocks don't occur in my testing with only patches 1
> and 2.
>
> [PATCH 1/2] Update woken requeued futex_q lock_ptr" patch
> [PATCH 2/2][RT] Avoid deadlock in rt_mutex_start_proxy_lock()
>
>
> > Since its only the outer futex's value that matters, right? Can't we pin
> > that using get_user_pages() before we take the HB lock and go into the
> > requeue loop? That way we're sure to be able to change it without
> > faulting.
>
> I now don't believe we have to do this. In fact, futex_lock_pi() exhibits a
> similar "race window" (simplified below):
>
> /*
> * Block on the PI mutex:
> */
> ret = rt_mutex_timed_lock(&q.pi_state->pi_mutex, to, 1);
>
> [RACE WINDOW ] (not really, see below)
>
> spin_lock(q.lock_ptr);
> /*
> * Fixup the pi_state owner and possibly acquire the lock if we
> * haven't already.
> */
> res = fixup_owner(uaddr, fshared, &q, !ret);
>
> Note that the rt_mutex is acquire while the q.lock_ptr (hb->lock) is not held
> (since we can sleep). This is FINE and not a race. Lets look at what happens
> if another task tries to get the lock during that time:
>
> futex_lock_pi
> futex_lock_pi_atomic
> lookup_pi_state
>
> At this point we have the pi_state. It's owner field will point to the
> previous owner, not the task that is currently acquiring it. But the rt_mutex
> itself knows who owns it, so proper boosting should still occur. Once the new
> owner complete the pi_state update, the pi_state will be removed from the old
> owner pi_state_list and added to its pi_state_list. Since the futex uval
> shows it's owned in both cases, the new contender is still forced into the
> kernel to block on the rt_mutex. Since we update the uval, then the
> pi_state->owner, we are sure to be able to access the rt_mutex via the old
> uval so long as we hold the hb->lock.
>
> So, I think we're fine with respect to the pi_state ownership! In fact I
> finally managed to catch the lock steal in the requeue loop in my tracing, and
> everything worked fine. Going to go rerun a bunch more tests and see if I hit
> any other issues, if I do, I suspect they are unrelated to this.
>
> Thanks for the help in thinking this through.
>
> --
> Darren Hart
> IBM Linux Technology Center
> Real-Time Linux Team
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/