[...]
Hi Christian,Could this be due to amdgpu setting sched->ready when the rings areYes and that is absolutely intentional.
finished initializing from long ago rather than when the scheduler has
been armed?
Either the driver is not done with it's resume yet, or it has already
started it's suspend handler. So the scheduler backends are not started
and so the ready flag is false.
But some userspace application still tries to submit work.
If we would now wait for this work to finish we would deadlock, so
crashing on the NULL pointer deref is actually the less worse outcome.
Christian.
Today in the morning, when I woke up my workstation, I was greeted
with a black screen, on which I still could move my mouse pointer. The
OOPS happens at resume time, not at suspend time:
...
Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.2.1 is not ready, skipping
Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.3.1 is not ready, skipping
Jän 10 07:58:14 ryzen9 kernel: BUG: kernel NULL pointer dereference,
address: 0000000000000008
Jän 10 07:58:14 ryzen9 kernel: #PF: supervisor read access in kernel mode
Jän 10 07:58:14 ryzen9 kernel: #PF: error_code(0x0000) - not-present page
Jän 10 07:58:14 ryzen9 kernel: PGD 0 P4D 0
Jän 10 07:58:14 ryzen9 kernel: Oops: Oops: 0000 [#2] PREEMPT SMP NOPTI
Jän 10 07:58:14 ryzen9 kernel: CPU: 2 UID: 1001 PID: 4961 Comm:
chrome:cs0 Tainted: G D OE 6.12.5-200.fc41.x86_64 #1
Jän 10 07:58:14 ryzen9 kernel: Tainted: [D]=DIE, [O]=OOT_MODULE,
[E]=UNSIGNED_MODULE
Jän 10 07:58:14 ryzen9 kernel: Hardware name: Micro-Star International
Co., Ltd. MS-7A38/B450M PRO-VDH MAX (MS-7A38), BIOS B.B0 02/03/2021
Jän 10 07:58:14 ryzen9 kernel: RIP: 0010:drm_sched_job_arm+0x23/0x60 [gpu_sched]
Jän 10 07:58:14 ryzen9 kernel: Code: 90 90 90 90 90 90 90 f3 0f 1e fa
0f 1f 44 00 00 55 53 48 8b 6f 60 48 85 ed 74 3f 48 89 fb 48 89 ef e8
e1 38 00 00 48 8b 45 10 <48> 8b 50 08 48 89 53 18 8b 45 24 89 43 5c b8
01 00 00 00 f0 48 0f
Jän 10 07:58:14 ryzen9 kernel: RSP: 0018:ffffa52510cf7758 EFLAGS: 00010206
...
Can we conclude that "the driver is not yet ready with it's resume"?
Can you point me to where I could add instrumentation code to dig deeper?
Thanks,
Philipp