Re: [Intel-gfx] [BUG 6.3-rc1] Bad lock in ttm_bo_delayed_delete()
From: Matthew Auld
Date: Wed Mar 15 2023 - 15:15:55 EST
On Wed, 15 Mar 2023 at 18:41, Christian König
<ckoenig.leichtzumerken@xxxxxxxxx> wrote:
>
> Am 08.03.23 um 13:43 schrieb Steven Rostedt:
> > On Wed, 8 Mar 2023 07:17:38 +0100
> > Christian König <christian.koenig@xxxxxxx> wrote:
> >
> >> What test case/environment do you run to trigger this?
> > I'm running a 32bit x86 qemu instance. Attached is the config.
> >
> > The libvirt xml file is here: https://rostedt.org/vm-images/tracetest-32.xml
> > and the VM image itself is here: https://rostedt.org/vm-images/tracetest-32.qcow2.bz2
>
> I've started to download that, but it will take about an hour. So I
> tried to avoid that for now.
>
> But looks like there isn't any other way to reproduce this, the code
> seems to work with both amdgpu and radeon.
>
> My suspicion is that we just have a reference count issue in qxl or ttm
> which was never noticed because it didn't caused any problems (except
> for a minor memory corruption).
Why does ttm_bo_cleanup_refs() do a bo_put() at the end? It doesn't
make sense to me. Say if the BO is in the process of being delay freed
(bo->deleted = true), and we just did the kref_init() in
ttm_bo_release(), it might drop that ref hitting ttm_bo_release() yet
again, this time doing the actual bo->destroy(), which frees the
object. The worker then fires at some later point calling
ttm_bo_delayed_delete(), but the BO has already been freed.
>
> Now you get a rain of warnings because we try to grab the lock in the
> delete worker.
>
> Christian.
>
> >
> > It happened again in another test (it's not 100% reproducible).
> >
> > [ 23.234838] ------------[ cut here ]------------
> > [ 23.236391] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
> > [ 23.236429] WARNING: CPU: 0 PID: 61 at kernel/locking/mutex.c:582 __ww_mutex_lock.constprop.0+0x566/0xfec
> > [ 23.240990] Modules linked in:
> > [ 23.242368] CPU: 0 PID: 61 Comm: kworker/0:1H Not tainted 6.3.0-rc1-test-00001-ga98bd42762ed-dirty #972
> > [ 23.245106] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
> > [ 23.247900] Workqueue: ttm ttm_bo_delayed_delete
> > [ 23.249642] EIP: __ww_mutex_lock.constprop.0+0x566/0xfec
> > [ 23.251563] Code: e8 2b 5a 95 ff 85 c0 0f 84 25 fb ff ff 8b 0d 18 71 3b c8 85 c9 0f 85 17 fb ff ff 68 c0 58 07 c8 68 07 77 05 c8 e8 e6 0a 40 ff <0f> 0b 58 5a e9 ff fa ff ff e8 f8 59 95 ff 85 c0 74 0e 8b 0d 18 71
> > [ 23.256901] EAX: 00000028 EBX: 00000000 ECX: c1847dd8 EDX: 00000002
> > [ 23.258849] ESI: 00000000 EDI: c12958bc EBP: c1847f00 ESP: c1847eac
> > [ 23.260786] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
> > [ 23.262840] CR0: 80050033 CR2: ffbff000 CR3: 0850e000 CR4: 00150ef0
> > [ 23.264781] Call Trace:
> > [ 23.265899] ? lock_is_held_type+0xbe/0x10c
> > [ 23.267434] ? ttm_bo_delayed_delete+0x30/0x94
> > [ 23.268971] ww_mutex_lock+0x32/0x94
> > [ 23.270327] ttm_bo_delayed_delete+0x30/0x94
> > [ 23.271818] process_one_work+0x21a/0x538
> > [ 23.273242] worker_thread+0x146/0x398
> > [ 23.274616] kthread+0xea/0x10c
> > [ 23.275859] ? process_one_work+0x538/0x538
> > [ 23.277312] ? kthread_complete_and_exit+0x1c/0x1c
> > [ 23.278899] ret_from_fork+0x1c/0x28
> > [ 23.280223] irq event stamp: 33
> > [ 23.281440] hardirqs last enabled at (33): [<c7d28089>] _raw_spin_unlock_irqrestore+0x2d/0x58
> > [ 23.283860] hardirqs last disabled at (32): [<c71d4ba5>] kvfree_call_rcu+0x155/0x2ec
> > [ 23.286066] softirqs last enabled at (0): [<c71217fd>] copy_process+0x989/0x2368
> > [ 23.288220] softirqs last disabled at (0): [<00000000>] 0x0
> > [ 23.289952] ---[ end trace 0000000000000000 ]---
> > [ 23.291501] ------------[ cut here ]------------
> > [ 23.293027] refcount_t: underflow; use-after-free.
> > [ 23.294644] WARNING: CPU: 0 PID: 61 at lib/refcount.c:28 refcount_warn_saturate+0xb6/0xfc
> > [ 23.296959] Modules linked in:
> > [ 23.298168] CPU: 0 PID: 61 Comm: kworker/0:1H Tainted: G W 6.3.0-rc1-test-00001-ga98bd42762ed-dirty #972
> > [ 23.301073] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
> > [ 23.303642] Workqueue: ttm ttm_bo_delayed_delete
> > [ 23.305190] EIP: refcount_warn_saturate+0xb6/0xfc
> > [ 23.306767] Code: 68 70 e1 0c c8 e8 f6 d6 a9 ff 0f 0b 58 c9 c3 90 80 3d 8a 78 38 c8 00 75 8a c6 05 8a 78 38 c8 01 68 9c e1 0c c8 e8 d6 d6 a9 ff <0f> 0b 59 c9 c3 80 3d 88 78 38 c8 00 0f 85 67 ff ff ff c6 05 88 78
> > [ 23.311935] EAX: 00000026 EBX: c1295950 ECX: c1847e40 EDX: 00000002
> > [ 23.313884] ESI: c12958bc EDI: f7591100 EBP: c1847f18 ESP: c1847f14
> > [ 23.315840] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010246
> > [ 23.317887] CR0: 80050033 CR2: ffbff000 CR3: 0850e000 CR4: 00150ef0
> > [ 23.319859] Call Trace:
> > [ 23.320978] ttm_bo_delayed_delete+0x8c/0x94
> > [ 23.322492] process_one_work+0x21a/0x538
> > [ 23.323959] worker_thread+0x146/0x398
> > [ 23.325353] kthread+0xea/0x10c
> > [ 23.326609] ? process_one_work+0x538/0x538
> > [ 23.328081] ? kthread_complete_and_exit+0x1c/0x1c
> > [ 23.329683] ret_from_fork+0x1c/0x28
> > [ 23.331011] irq event stamp: 33
> > [ 23.332251] hardirqs last enabled at (33): [<c7d28089>] _raw_spin_unlock_irqrestore+0x2d/0x58
> > [ 23.334334] hardirqs last disabled at (32): [<c71d4ba5>] kvfree_call_rcu+0x155/0x2ec
> > [ 23.336176] softirqs last enabled at (0): [<c71217fd>] copy_process+0x989/0x2368
> > [ 23.337904] softirqs last disabled at (0): [<00000000>] 0x0
> > [ 23.339313] ---[ end trace 0000000000000000 ]---
> >
> > -- Steve
>