Re: [patch 0/2] locking/rtmutex: Cure two subtle bugs
From: Peter Zijlstra
Date: Thu Aug 26 2021 - 09:27:41 EST
On Wed, Aug 25, 2021 at 12:33:11PM +0200, Thomas Gleixner wrote:
> The fixes are straight forward, but the rtmutex based ww_mutex
> implementation still has some more rough egdes which need to be ironed out.
Something like the below (which should be two patches).
There's two issues at hand:
- 'spurious' -EDEADLK reports due to prior loops
- actual deadlock detection that will/should be resolved by ww_mutex
instead.
Both stem from the fact that ww_mutex can legitimately create cycles in
the lock graph. The first issue is because a blocker (P3 blow) can get
trapped in a cycle it didn't cause and hit the iteration depth.
The latter can be observed when we suppose P2 to be the highest priority
and thus should win progress rights over P1, but P2 would also be 'late'
and be the one to close the cycle. In that case PI-walk would report
-EDEADLK to P2, even though w/w would pick and wake P1 to receive
-EDEADLK.
Now; afaict these conditions below capture the above, but then fail to
adequately handle more complex locking chains where, for example, two
w/w classes are inverted, which is a geniune deadlock, which the below
will suppress.
Luckily w/w stuff is in-kernel only and lockdep should capture them; any
user chains should be unaffected.
famous last words etc..
---
kernel/locking/rtmutex.c | 40 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 39 insertions(+), 1 deletion(-)
diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
index c8fe74ef8db9..b7145d724b5f 100644
--- a/kernel/locking/rtmutex.c
+++ b/kernel/locking/rtmutex.c
@@ -656,6 +656,31 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
if (next_lock != waiter->lock)
goto out_unlock_pi;
+ /*
+ * There could be 'spurious' loops in the lock graph due to ww_mutex,
+ * consider:
+ *
+ * P1: A, ww_A, ww_B
+ * P2: ww_B, ww_A
+ * P3: A
+ *
+ * P3 should not return -EDEADLK because it gets trapped in the cycle
+ * created by P1 and P2 (which will resolve -- and runs into
+ * max_lock_depth above). Therefore disable detect_deadlock such that
+ * the below termination condition can trigger once all relevant tasks
+ * are boosted.
+ *
+ * Even when we start with ww_mutex we can disable deadlock detection,
+ * since we would supress a ww_mutex induced deadlock at [6] anyway.
+ * Supressing it here however is not sufficient since we might still
+ * hit [6] due to adjustment driven iteration.
+ *
+ * NOTE: if someone were to create a deadlock between 2 ww_classes we'd
+ * utterly fail to report it; lockdep should.
+ */
+ if (waiter->ww_ctx && detect_deadlock)
+ detect_deadlock = false;
+
/*
* Drop out, when the task has no waiters. Note,
* top_waiter can be NULL, when we are in the deboosting
@@ -717,8 +742,21 @@ static int __sched rt_mutex_adjust_prio_chain(struct task_struct *task,
* walk, we detected a deadlock.
*/
if (lock == orig_lock || rt_mutex_owner(lock) == top_task) {
- raw_spin_unlock(&lock->wait_lock);
ret = -EDEADLK;
+
+ /*
+ * When the deadlock is due to ww_mutex; also see above. Don't
+ * report the deadlock and instead let the ww_mutex wound/die
+ * logic pick which of the contending threads gets -EDEADLK.
+ *
+ * NOTE: assumes the cycle only contains a single ww_class; any
+ * other configuration and we fail to report; also, see
+ * lockdep.
+ */
+ if (orig_waiter->ww_ctx)
+ ret = 0;
+
+ raw_spin_unlock(&lock->wait_lock);
goto out_unlock_pi;
}