Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

From: Christian König
Date: Mon Jan 13 2025 - 03:32:29 EST


Am 10.01.25 um 16:10 schrieb Alex Deucher:
On Fri, Jan 10, 2025 at 9:48 AM Christian König
<christian.koenig@xxxxxxx> wrote:
Am 10.01.25 um 15:32 schrieb Philipp Reisner:
[...]
Take a look at those messages right before the crash:

Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.2.1 is not ready,
skipping
Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.3.1 is not ready,
skipping

That is basically a 100% certain confirm that an application tries to
use the device before before those compute queues are resumed.

Can I have a full dmesg? Maybe the resume is canceled or aborted for
some reason.

Yes, of course. I have made the files available here:
https://drive.google.com/drive/folders/1W3M3bFEl0ZVv2rnqvmbveDFZBhc84BNa
Ah! That suddenly makes much more sense.

Here is the root cause:

[111313.897796] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
[amdgpu]] *ERROR* ring comp_1.1.0 test failed (-110)
[111314.135761] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
[amdgpu]] *ERROR* ring comp_1.2.0 test failed (-110)
[111314.373786] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
[amdgpu]] *ERROR* ring comp_1.0.1 test failed (-110)
[111314.611722] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
[amdgpu]] *ERROR* ring comp_1.1.1 test failed (-110)
[111314.849647] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
[amdgpu]] *ERROR* ring comp_1.2.1 test failed (-110)
[111315.087658] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
[amdgpu]] *ERROR* ring comp_1.3.1 test failed (-110)
[111315.207293] [drm] UVD and UVD ENC initialized successfully.
[111315.308270] [drm] VCE initialized successfully.
[111315.447494] PM: resume devices took 2.306 seconds
[111315.447865] OOM killer enabled.

I'm surprised that this works at all. For some reason the graphics queue
works, but the compute queues fail to resume.

@Alex what do we do about that? We could return an error when not all
rings come up again after resume, but that will probably result in a
number of complains.
Maybe return an error if all of the rings of a particular type fail,
but if only some do, we should be able to deal with that. We
currently set up 8 compute rings. We probably don't need that many.
Maybe just two (high and low priority).

Reducing the number of queues would make the problem even more severe instead of helping since you then have even less chance of successfully resuming.

Currently we don't abort resume when the compute queues don't resume, but this leads to a crash later on.

The issue is that when we start to abort resume the end user experience doesn't really improve, we just avoid the crash.

Either we need to tell Mesa to stop using the compute queues by default (what is that good for anyway?) or we need to get the compute queues reliable working after a resume.

Christian.


Alex

Regards,
Christian.


best regards,
Philipp