Re: [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()

From: Steven Rostedt
Date: Wed Mar 15 2023 - 11:10:03 EST


On Wed, 8 Mar 2023 07:17:38 +0100
Christian König <christian.koenig@xxxxxxx> wrote:

> Am 08.03.23 um 03:26 schrieb Steven Rostedt:
> > On Tue, 7 Mar 2023 21:22:23 -0500
> > Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
> >
> >> Looks like there was a lock possibly used after free. But as commit
> >> 9bff18d13473a9fdf81d5158248472a9d8ecf2bd ("drm/ttm: use per BO cleanup
> >> workers") changed a lot of this code, I figured it may be the culprit.
> > If I bothered to look at the second warning after this one (I usually stop
> > after the first), it appears to state there was a use after free issue.
>
> Yeah, that looks like the reference count was somehow messed up.
>
> What test case/environment do you run to trigger this?
>
> Thanks for the notice,

I'm still getting this on Linus's latest tree.

[ 230.530222] ------------[ cut here ]------------
[ 230.569795] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
[ 230.569957] WARNING: CPU: 0 PID: 212 at kernel/locking/mutex.c:582 __ww_mutex_lock.constprop.0+0x62a/0x1300
[ 230.612599] Modules linked in:
[ 230.632144] CPU: 0 PID: 212 Comm: kworker/0:8H Not tainted 6.3.0-rc2-test-00047-g6015b1aca1a2-dirty #992
[ 230.654939] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[ 230.678866] Workqueue: ttm ttm_bo_delayed_delete
[ 230.699452] EIP: __ww_mutex_lock.constprop.0+0x62a/0x1300
[ 230.720582] Code: e8 3b 9a 95 ff 85 c0 0f 84 61 fa ff ff 8b 0d 58 bc 3a c4 85 c9 0f 85 53 fa ff ff 68 54 98 06 c4 68 b7 b6 04 c4 e8 46 af 40 ff <0f> 0b 58 5a e9 3b fa ff ff 8d 74 26 00 90 a1 ec 47 b0 c4 85 c0 75
[ 230.768336] EAX: 00000028 EBX: 00000000 ECX: c51abdd8 EDX: 00000002
[ 230.792001] ESI: 00000000 EDI: c53856bc EBP: c51abf00 ESP: c51abeac
[ 230.815944] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010246
[ 230.840033] CR0: 80050033 CR2: ff9ff000 CR3: 04506000 CR4: 00150ef0
[ 230.864059] Call Trace:
[ 230.886369] ? ttm_bo_delayed_delete+0x30/0x94
[ 230.909902] ww_mutex_lock+0x32/0x94
[ 230.932550] ttm_bo_delayed_delete+0x30/0x94
[ 230.955798] process_one_work+0x21a/0x484
[ 230.979335] worker_thread+0x14a/0x39c
[ 231.002258] kthread+0xea/0x10c
[ 231.024769] ? process_one_work+0x484/0x484
[ 231.047870] ? kthread_complete_and_exit+0x1c/0x1c
[ 231.071498] ret_from_fork+0x1c/0x28
[ 231.094701] irq event stamp: 4023
[ 231.117272] hardirqs last enabled at (4023): [<c3d1df99>] _raw_spin_unlock_irqrestore+0x2d/0x58
[ 231.143217] hardirqs last disabled at (4022): [<c31d5a55>] kvfree_call_rcu+0x155/0x2ec
[ 231.166058] softirqs last enabled at (3460): [<c3d1f403>] __do_softirq+0x2c3/0x3bb
[ 231.183104] softirqs last disabled at (3455): [<c30c96a9>] call_on_stack+0x45/0x4c
[ 231.200336] ---[ end trace 0000000000000000 ]---
[ 231.216572] ------------[ cut here ]------------


This is preventing me from adding any of my own patches on v6.3-rcX due to
this bug failing my tests. Which means I can't add anything to linux-next
until this is fixed!

-- Steve

Attachment: config-fail
Description: Binary data