Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

From: Christian König
Date: Fri Jan 10 2025 - 03:44:46 EST


Am 10.01.25 um 08:37 schrieb Philipp Reisner:
[...]
Could this be due to amdgpu setting sched->ready when the rings are
finished initializing from long ago rather than when the scheduler has
been armed?
Yes and that is absolutely intentional.

Either the driver is not done with it's resume yet, or it has already
started it's suspend handler. So the scheduler backends are not started
and so the ready flag is false.

But some userspace application still tries to submit work.

If we would now wait for this work to finish we would deadlock, so
crashing on the NULL pointer deref is actually the less worse outcome.

Christian.
Hi Christian,

Today in the morning, when I woke up my workstation, I was greeted
with a black screen, on which I still could move my mouse pointer. The
OOPS happens at resume time, not at suspend time:

...
Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.2.1 is not ready, skipping
Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.3.1 is not ready, skipping
Jän 10 07:58:14 ryzen9 kernel: BUG: kernel NULL pointer dereference,
address: 0000000000000008
Jän 10 07:58:14 ryzen9 kernel: #PF: supervisor read access in kernel mode
Jän 10 07:58:14 ryzen9 kernel: #PF: error_code(0x0000) - not-present page
Jän 10 07:58:14 ryzen9 kernel: PGD 0 P4D 0
Jän 10 07:58:14 ryzen9 kernel: Oops: Oops: 0000 [#2] PREEMPT SMP NOPTI
Jän 10 07:58:14 ryzen9 kernel: CPU: 2 UID: 1001 PID: 4961 Comm:
chrome:cs0 Tainted: G D OE 6.12.5-200.fc41.x86_64 #1
Jän 10 07:58:14 ryzen9 kernel: Tainted: [D]=DIE, [O]=OOT_MODULE,
[E]=UNSIGNED_MODULE
Jän 10 07:58:14 ryzen9 kernel: Hardware name: Micro-Star International
Co., Ltd. MS-7A38/B450M PRO-VDH MAX (MS-7A38), BIOS B.B0 02/03/2021
Jän 10 07:58:14 ryzen9 kernel: RIP: 0010:drm_sched_job_arm+0x23/0x60 [gpu_sched]
Jän 10 07:58:14 ryzen9 kernel: Code: 90 90 90 90 90 90 90 f3 0f 1e fa
0f 1f 44 00 00 55 53 48 8b 6f 60 48 85 ed 74 3f 48 89 fb 48 89 ef e8
e1 38 00 00 48 8b 45 10 <48> 8b 50 08 48 89 53 18 8b 45 24 89 43 5c b8
01 00 00 00 f0 48 0f
Jän 10 07:58:14 ryzen9 kernel: RSP: 0018:ffffa52510cf7758 EFLAGS: 00010206
...

Can we conclude that "the driver is not yet ready with it's resume"?

Take a look at those messages right before the crash:

Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.2.1 is not ready, skipping
Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.3.1 is not ready, skipping

That is basically a 100% certain confirm that an application tries to use the device before before those compute queues are resumed.

Can I have a full dmesg? Maybe the resume is canceled or aborted for some reason.

Or we have a bug in the driver that we forget to resume the compute queues and only do that on demand later on or something like that.

Can you point me to where I could add instrumentation code to dig deeper?

Let me take a look at the full dmesg, something really fishy is going on here.

Thanks,
Christian.


Thanks,
Philipp