[BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

From: Steven Rostedt
Date: Tue Mar 07 2023 - 21:22:34 EST



In a report for a regression in my code, I tried to run v6.3-rc1 through my
tests. It crashed at boot up on my first test (my start up tests do take a
long time, hence the 206 seconds of boot!).

[ 206.238782] ------------[ cut here ]------------
[ 206.277786] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
[ 206.277946] WARNING: CPU: 0 PID: 332 at kernel/locking/mutex.c:582 __ww_mutex_lock.constprop.0+0x566/0xfec
[ 206.313338] Modules linked in:
[ 206.324732] CPU: 0 PID: 332 Comm: kworker/0:13H Not tainted 6.3.0-rc1-test-00001-ga98bd42762ed-dirty #965
[ 206.338273] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[ 206.353596] Workqueue: ttm ttm_bo_delayed_delete
[ 206.370520] EIP: __ww_mutex_lock.constprop.0+0x566/0xfec
[ 206.382855] Code: e8 ab 59 95 ff 85 c0 0f 84 25 fb ff ff 8b 0d 58 c0 3b cf 85 c9 0f 85 17 fb ff ff 68 e0 8d 07 cf 68 2b ac 05 cf e8 e6 e6 3f ff <0f> 0b 58 5a e9 ff fa ff ff e8 78 59 95 ff 85 c0 74 0e 8b 0d 58 c0
[ 206.411247] EAX: 00000028 EBX: 00000000 ECX: c3ae5dd8 EDX: 00000002
[ 206.425193] ESI: 00000000 EDI: c2d5f0bc EBP: c3ae5f00 ESP: c3ae5eac
[ 206.439236] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010246
[ 206.453597] CR0: 80050033 CR2: ff9ff000 CR3: 0f512000 CR4: 00150ef0
[ 206.467841] Call Trace:
[ 206.481059] ? ttm_bo_delayed_delete+0x30/0x94
[ 206.494980] ww_mutex_lock+0x32/0x94
[ 206.508699] ttm_bo_delayed_delete+0x30/0x94
[ 206.522371] process_one_work+0x21a/0x538
[ 206.536306] worker_thread+0x146/0x398
[ 206.549860] kthread+0xea/0x10c
[ 206.563141] ? process_one_work+0x538/0x538
[ 206.576835] ? kthread_complete_and_exit+0x1c/0x1c
[ 206.590652] ret_from_fork+0x1c/0x28
[ 206.604522] irq event stamp: 4219
[ 206.617852] hardirqs last enabled at (4219): [<ced2a039>] _raw_spin_unlock_irqrestore+0x2d/0x58
[ 206.633077] hardirqs last disabled at (4218): [<ce1d3a65>] kvfree_call_rcu+0x155/0x2ec
[ 206.648161] softirqs last enabled at (3570): [<ced2b113>] __do_softirq+0x2f3/0x48b
[ 206.663025] softirqs last disabled at (3565): [<ce0c84e9>] call_on_stack+0x45/0x4c
[ 206.678065] ---[ end trace 0000000000000000 ]---

Looks like there was a lock possibly used after free. But as commit
9bff18d13473a9fdf81d5158248472a9d8ecf2bd ("drm/ttm: use per BO cleanup
workers") changed a lot of this code, I figured it may be the culprit.

-- Steve