Re: [PATCH 1/3] locking/ww_mutex: cleanup lock->ctx usage in amdgpu

From: Christian KÃnig
Date: Tue Feb 20 2018 - 04:44:01 EST


Am 19.02.2018 um 17:43 schrieb Daniel Vetter:
On Mon, Feb 19, 2018 at 05:29:46PM +0100, Christian KÃnig wrote:
[SNIP]
Well that is not a problem at all. See we don't nest trylock with normal
lock acquiring, cause that would indeed bypass the whole deadlock detection.

Instead we first use ww_mutex_acquire to lock all BOs which are needed for a
command submission, including the deadlock detection.

Then all additional BOs which needed to be evicted to fulfill the current
request are trylocked.
Yeah it's the lock; trylock; lock combo that deadlocks. If lockdep indeed
catches that one (and not some other recursion combo) then I think we
don't have to worry about holding tons of trylock'ed locks.

Well I haven't explicitly tested the lock; trylock; lock case, but you get a warning anyway in the lock; ... anything ...; lock case.

See the first and the second lock can't use the same acquire context, because that isn't known down the call stack and lockdep warns about that quite intensively.

What is a problem is that lockdep sometimes runs out of space to keep track of all the trylocked mutexes, but that could have happened before as well if I'm not completely mistaken.

If this is really what you want to do, then we need a
ww_mutex_trylock_ctx, which also fills out the ctx value (so that other
threads can correctly resolve deadlocks when you hold that lock while
trying to grab additional locks). In which case you really don't need the
task pointer.
Actually considered that as well, but it turned out that this is exactly
what I don't want.

Cause then we wouldn't be able to distinct ww_mutex locked with a context
(because they are part of the submission) and without (because TTM trylocked
them).
Out of curiosity, why do you need to know that?

The control flow in TTM is that when you trylocked a BO you start to evict it.

Now sometimes it happens that we evict it from VRAM to GTT, but then find that we don't have enough GTT space and need to evict something from there to the SYSTEM domain.

The problem is now since the reservation object is trylocked because of the VRAM to GTT eviction we can't lock it again because of the GTT to the SYSTEM domain.

Yes it's a disappointment that lockdep doesn't correctly track trylocks,
it just does basic sanity checks, but then drops them on the floor wrt
depency tracking. Just in case you wonder why you're not getting a
lockdeps splat for this. Unfortunately I don't understand lockdep enough
to be able to fix this gap.
Sorry to disappoint you, but lockdep is indeed capable to correctly track
those trylocked BOs.

Got quite a bunch of warning before I was able to resolve to this solution.
Hm, I thought it didn't detect a lock; trylock; lock combo because the
trylock didn't show up in the dependency stack. Maybe that got fixed
meanwhile.

Yeah I can confirm that this indeed got fixed.

Assuming that we indeed need both, could we split up the two use-cases for
clarity? I.e. ww_mutex_is_owned_by_ctx, which only takes the ctx (and
forgoes checking for a task, since that's implied) and requires a non-NULL
ctx. And ww_mutex_is_owned_by_task, which only takes the task (and maybe
we should limit it to trylock and _single locks, i.e. WARN_ON(lock->ctx)
if ww-mutex debugging is enabled).

Or does that hit another requirement of your use-case?

Well we could add two tests, one which only checks the context and one which checks that the context is NULL and then checks the mutex owner.

But to me it actually looks more like that makes it unnecessary complicated. The use case in amdgpu which could only check the context isn't performance critical.

Christian.

-Daniel