Re: [PATCH 1/4] locking/ww_mutex: Fix a deadlock affecting ww_mutexes

From: Daniel Vetter
Date: Wed Nov 23 2016 - 08:08:56 EST


On Wed, Nov 23, 2016 at 02:00:46PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 23, 2016 at 12:25:22PM +0100, Nicolai Hähnle wrote:
> > From: Nicolai Hähnle <Nicolai.Haehnle@xxxxxxx>
> >
> > Fix a race condition involving 4 threads and 2 ww_mutexes as indicated in
> > the following example. Acquire context stamps are ordered like the thread
> > numbers, i.e. thread #1 should back off when it encounters a mutex locked
> > by thread #0 etc.
> >
> > Thread #0 Thread #1 Thread #2 Thread #3
> > --------- --------- --------- ---------
> > lock(ww)
> > success
> > lock(ww')
> > success
> > lock(ww)
> > lock(ww) .
> > . . unlock(ww) part 1
> > lock(ww) . . .
> > success . . .
> > . . unlock(ww) part 2
> > . back off
> > lock(ww') .
> > . .
> > (stuck) (stuck)
> >
> > Here, unlock(ww) part 1 is the part that sets lock->base.count to 1
> > (without being protected by lock->base.wait_lock), meaning that thread #0
> > can acquire ww in the fast path or, much more likely, the medium path
> > in mutex_optimistic_spin. Since lock->base.count == 0, thread #0 then
> > won't wake up any of the waiters in ww_mutex_set_context_fastpath.
> >
> > Then, unlock(ww) part 2 wakes up _only_the_first_ waiter of ww. This is
> > thread #2, since waiters are added at the tail. Thread #2 wakes up and
> > backs off since it sees ww owned by a context with a lower stamp.
> >
> > Meanwhile, thread #1 is never woken up, and so it won't back off its lock
> > on ww'. So thread #0 gets stuck waiting for ww' to be released.
> >
> > This patch fixes the deadlock by waking up all waiters in the slow path
> > of ww_mutex_unlock.
> >
> > We have an internal test case for amdgpu which continuously submits
> > command streams from tens of threads, where all command streams reference
> > hundreds of GPU buffer objects with a lot of overlap in the buffer lists
> > between command streams. This test reliably caused a deadlock, and while I
> > haven't completely confirmed that it is exactly the scenario outlined
> > above, this patch does fix the test case.
> >
> > v2:
> > - use wake_q_add
> > - add additional explanations
> >
> > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> > Cc: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> > Cc: Maarten Lankhorst <maarten.lankhorst@xxxxxxxxxxxxx>
> > Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx
> > Cc: stable@xxxxxxxxxxxxxxx
> > Reviewed-by: Christian König <christian.koenig@xxxxxxx> (v1)
> > Signed-off-by: Nicolai Hähnle <nicolai.haehnle@xxxxxxx>
>
> Completely and utterly fails to apply; I think this patch is based on
> code prior to the mutex rewrite.
>
> Please rebase on tip/locking/core.
>
> Also, is this a regression, or has this been a 'feature' of the ww_mutex
> code from early on?

Sorry forgot to mention that, but I checked. Seems to have been broken
since day 1, at least looking at the original code the wake-single-waiter
stuff is as old as the mutex code added in 2006.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch