Re: [PATCH bpf-next v1 11/22] rqspinlock: Add deadlock detection and recovery

From: Waiman Long
Date: Wed Jan 08 2025 - 19:32:34 EST


On 1/8/25 3:19 PM, Kumar Kartikeya Dwivedi wrote:
On Wed, 8 Jan 2025 at 21:36, Waiman Long <llong@xxxxxxxxxx> wrote:

On 1/7/25 8:59 AM, Kumar Kartikeya Dwivedi wrote:
While the timeout logic provides guarantees for the waiter's forward
progress, the time until a stalling waiter unblocks can still be long.
The default timeout of 1/2 sec can be excessively long for some use
cases. Additionally, custom timeouts may exacerbate recovery time.

Introduce logic to detect common cases of deadlocks and perform quicker
recovery. This is done by dividing the time from entry into the locking
slow path until the timeout into intervals of 1 ms. Then, after each
interval elapses, deadlock detection is performed, while also polling
the lock word to ensure we can quickly break out of the detection logic
and proceed with lock acquisition.

A 'held_locks' table is maintained per-CPU where the entry at the bottom
denotes a lock being waited for or already taken. Entries coming before
it denote locks that are already held. The current CPU's table can thus
be looked at to detect AA deadlocks. The tables from other CPUs can be
looked at to discover ABBA situations. Finally, when a matching entry
for the lock being taken on the current CPU is found on some other CPU,
a deadlock situation is detected. This function can take a long time,
therefore the lock word is constantly polled in each loop iteration to
ensure we can preempt detection and proceed with lock acquisition, using
the is_lock_released check.

We set 'spin' member of rqspinlock_timeout struct to 0 to trigger
deadlock checks immediately to perform faster recovery.

Note: Extending lock word size by 4 bytes to record owner CPU can allow
faster detection for ABBA. It is typically the owner which participates
in a ABBA situation. However, to keep compatibility with existing lock
words in the kernel (struct qspinlock), and given deadlocks are a rare
event triggered by bugs, we choose to favor compatibility over faster
detection.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx>
---
include/asm-generic/rqspinlock.h | 56 +++++++++-
kernel/locking/rqspinlock.c | 178 ++++++++++++++++++++++++++++---
2 files changed, 220 insertions(+), 14 deletions(-)

diff --git a/include/asm-generic/rqspinlock.h b/include/asm-generic/rqspinlock.h
index 5c996a82e75f..c7e33ccc57a6 100644
--- a/include/asm-generic/rqspinlock.h
+++ b/include/asm-generic/rqspinlock.h
@@ -11,14 +11,68 @@

#include <linux/types.h>
#include <vdso/time64.h>
+#include <linux/percpu.h>

struct qspinlock;

+extern int resilient_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val, u64 timeout);
+
/*
* Default timeout for waiting loops is 0.5 seconds
*/
#define RES_DEF_TIMEOUT (NSEC_PER_SEC / 2)

-extern int resilient_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val, u64 timeout);
+#define RES_NR_HELD 32
+
+struct rqspinlock_held {
+ int cnt;
+ void *locks[RES_NR_HELD];
+};
+
+DECLARE_PER_CPU_ALIGNED(struct rqspinlock_held, rqspinlock_held_locks);
+
+static __always_inline void grab_held_lock_entry(void *lock)
+{
+ int cnt = this_cpu_inc_return(rqspinlock_held_locks.cnt);
+
+ if (unlikely(cnt > RES_NR_HELD)) {
+ /* Still keep the inc so we decrement later. */
+ return;
+ }
+
+ /*
+ * Implied compiler barrier in per-CPU operations; otherwise we can have
+ * the compiler reorder inc with write to table, allowing interrupts to
+ * overwrite and erase our write to the table (as on interrupt exit it
+ * will be reset to NULL).
+ */
+ this_cpu_write(rqspinlock_held_locks.locks[cnt - 1], lock);
+}
+
+/*
+ * It is possible to run into misdetection scenarios of AA deadlocks on the same
+ * CPU, and missed ABBA deadlocks on remote CPUs when this function pops entries
+ * out of order (due to lock A, lock B, unlock A, unlock B) pattern. The correct
+ * logic to preserve right entries in the table would be to walk the array of
+ * held locks and swap and clear out-of-order entries, but that's too
+ * complicated and we don't have a compelling use case for out of order unlocking.
Maybe we can pass in the lock and print a warning if out-of-order unlock
is being done.
I think alternatively, I will constrain the verifier in v2 to require
lock release to be in-order, which would obviate the need to warn at
runtime and reject programs potentially doing out-of-order unlocks.
This doesn't cover in-kernel users though, but we're not doing
out-of-order unlocks with this lock there, and it would be yet another
branch in the unlock function with little benefit.

That will work too.

Cheers,
Longman