Re: Non-deterministically boot into dark screen with `amdgpu`
From: Alex Deucher
Date: Mon Aug 10 2020 - 16:35:52 EST
On Mon, Aug 10, 2020 at 7:46 AM Christian König
<ckoenig.leichtzumerken@xxxxxxxxx> wrote:
>
> Hi guys,
>
> Am 10.08.20 um 08:43 schrieb Alexander Monakov:
>
> Hi,
>
> you should Сс a specialized mailing list and a relevant maintainer,
> otherwise your email is likely to be ignored as LKML is an incredibly
> high-volume list. Adding amd-gfx and Alex Deucher.
>
>
> Thanks for forwarding this. AFAIK we haven't heard of this bug before, but Alex already might know more about it.
>
> More thoughts below.
>
> On Sun, 9 Aug 2020, Ignat Insarov wrote:
>
> Hello!
>
> This is an issue report. I am not familiar with the Linux kernel
> development procedure, so please direct me to a more appropriate or
> specialized medium if this is not the right avenue.
>
> My laptop (Ryzen 7 Pro CPU/GPU) boots into dark screen more often than
> not. Screen blackness correlates with a line in the `systemd` journal
> that says `RAM width Nbits DDR4`, where N is either 128 (resulting in
> dark screen) or 64 (resulting in a healthy boot). The number seems to
> be chosen at random with bias towards 128. This has been going on for
> a while so here is some statistics:
>
> * 356 boots proceed far enough to attempt mode setting.
> * 82 boots set RAM width to 64 bits and presumably succeed.
> * 274 boots set RAM width to 128 bits and presumably fail.
>
> The issue is prevented with the `nomodeset` kernel option.
>
> I reported this previously (about a year ago) on the forum of my Linux
> distribution.[1] The issue still persists as of linux 5.8.0.
>
> The details of my graphics controller, as well as some journal
> excerpts, can be seen at [1]. One thing that has changed since then is
> that on failure, there now appears a null pointer dereference error. I
> am attaching the log of kernel messages from the most recent failed
> boot — please request more information if needed.
>
> I appreciate any directions and advice as to how I may go about fixing
> this annoyance.
>
> [1]: https://bbs.archlinux.org/viewtopic.php?id=248273
>
> On the forum you show that in the "success" case there's one less "BIOS
> signature incorrect" message. This implies that amdgpu_get_bios() in
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c
> gets the video BIOS from a different source. If that happens every time
> (one "signature incorrect" message for "success", two for "failure")
> that may be relevant to the problem you're experiencing.
>
> If you don't mind patching and rebuilding the kernel I suggest adding
> debug printks to the aforementioned function to see exactly which methods
> fail with wrong signature and which succeeds.
>
> Also might be worthwhile to check if there's a BIOS update for your laptop.
>
>
> It might also be a good idea to try the latest amd-staging-drm-next branch from Alex repository (bear with me I don't have the link at hand, but it should be easy to find).
>
> Opening a bug report or searching the existing ones for something similar under https://gitlab.freedesktop.org/drm/amd/-/issues might be a good idea as well.
>
> And I completely agree that this sounds like an issue getting the BIOS image.
I've not heard of an issue like this either. Best to file a gitlab
bug and attach your full dmesg output in both the working and
non-working cases and we can go from there.
Alex
>
> Thanks,
> Christian.
>
>
> Alexander
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx