Re: [git pull] drm for 6.8

From: Mario Limonciello
Date: Wed Jan 24 2024 - 13:24:05 EST


On 1/24/2024 11:52, Mario Limonciello wrote:
On 1/24/2024 11:51, Thorsten Leemhuis wrote:
Linus, if you have a minute, I'd really like to know...

On 24.01.24 17:41, Mario Limonciello wrote:
On 1/24/2024 10:24, Vlastimil Babka wrote:
On 1/24/24 16:31, Donald Carr wrote:
On Wed, Jan 24, 2024 at 7:06 AM Vlastimil Babka <vbabka@xxxxxxx> wrote:
When testing the rc1 on my openSUSE Tumbleweed desktop, I've started
experiencing "frozen desktop" (KDE/Wayland) issues. The symptoms are
that
everything freezes including mouse cursor. After a while it either
resolves,
or e.g. firefox crashes (if it was actively used when it froze) or it's
frozen for too long and I reboot with alt-sysrq-b. When it's frozen
I can
still ssh to the machine, and there's nothing happening in dmesg.
The machine is based on Amd Ryzen 7 2700 and Radeon RX7600.
[...]
I am experiencing the exact same symptoms;

Big thanks to Thorsten who suggested I look at the following:

https://lore.kernel.org/all/20240123021155.2775-1-mario.limonciello@xxxxxxx/
https://lore.kernel.org/all/CABXGCsM2VLs489CH-vF-1539-s3in37=bwuOWtoeeE+q26zE+Q@xxxxxxxxxxxxxx/

Instead of further bisection I've applied Mario's revert from the
first link
on top of 6.8-rc1 and the issue seems gone for me now.

Thanks for confirming.  I don't think we should jump right to the revert
right now.

   I posted it in case that is the direction we need to go
(simple git revert didn't work due to contextual changes).

Let's give the folks who work on GPU scheduler some time to understand
the failure and see if they can fix it.

...how you think about this and other situations like this. Given that
we have

* two affected people in this thread
* one earlier thread about it
* the machine that made Mario write the patch
* and I have someone in #fedora-kernel that likely is affected as well

it seems that this is not some corner case very few people run into.
Hence I tend to say that this should be dealt with rather sooner than
later. Maybe before rc2? Or is this asking too much?

The thing from my point of view is, that each such problem might
discourage testers from testing again or lead to thoughts like "I only
start testing after -rc4". Not to mention that other people will try to
bisect the problem like Vlastimil did, which will cost them quite some
time and effort -- only to find out that we known about the problem
already and did not quickly fix it. That is discouraging for them as
well and thus bad for field testing I'd assume.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

A test patch was just posted.  I haven't gotten a chance to try it yet. I will this afternoon.

The test patch [1] posted to [2] works for me. I expect that Matthew will post it to dri-devel and this can catch RC2 or RC3.

[1] https://gitlab.freedesktop.org/drm/amd/uploads/ca8dfaa22d6f5d247c28acf6cf3eafd2/0001-Drain-all-entities-in-DRM-run-jon-worker.patch
[2] https://gitlab.freedesktop.org/drm/amd/-/issues/3124