Re: Non-deterministically boot into dark screen with `amdgpu`

From: Alexander Monakov
Date: Mon Aug 10 2020 - 02:44:03 EST


Hi,

you should Сс a specialized mailing list and a relevant maintainer,
otherwise your email is likely to be ignored as LKML is an incredibly
high-volume list. Adding amd-gfx and Alex Deucher.

More thoughts below.

On Sun, 9 Aug 2020, Ignat Insarov wrote:

> Hello!
>
> This is an issue report. I am not familiar with the Linux kernel
> development procedure, so please direct me to a more appropriate or
> specialized medium if this is not the right avenue.
>
> My laptop (Ryzen 7 Pro CPU/GPU) boots into dark screen more often than
> not. Screen blackness correlates with a line in the `systemd` journal
> that says `RAM width Nbits DDR4`, where N is either 128 (resulting in
> dark screen) or 64 (resulting in a healthy boot). The number seems to
> be chosen at random with bias towards 128. This has been going on for
> a while so here is some statistics:
>
> * 356 boots proceed far enough to attempt mode setting.
> * 82 boots set RAM width to 64 bits and presumably succeed.
> * 274 boots set RAM width to 128 bits and presumably fail.
>
> The issue is prevented with the `nomodeset` kernel option.
>
> I reported this previously (about a year ago) on the forum of my Linux
> distribution.[1] The issue still persists as of linux 5.8.0.
>
> The details of my graphics controller, as well as some journal
> excerpts, can be seen at [1]. One thing that has changed since then is
> that on failure, there now appears a null pointer dereference error. I
> am attaching the log of kernel messages from the most recent failed
> boot — please request more information if needed.
>
> I appreciate any directions and advice as to how I may go about fixing
> this annoyance.
>
> [1]: https://bbs.archlinux.org/viewtopic.php?id=248273


On the forum you show that in the "success" case there's one less "BIOS
signature incorrect" message. This implies that amdgpu_get_bios() in
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
gets the video BIOS from a different source. If that happens every time
(one "signature incorrect" message for "success", two for "failure")
that may be relevant to the problem you're experiencing.

If you don't mind patching and rebuilding the kernel I suggest adding
debug printks to the aforementioned function to see exactly which methods
fail with wrong signature and which succeeds.

Also might be worthwhile to check if there's a BIOS update for your laptop.

Alexander