Re: (REGRESSION bisected) Re: amdgpu errors (VM fault / GPU fault detected) with 5.19 merge window snapshots

From: Christian König
Date: Wed Jun 01 2022 - 10:59:20 EST


Am 01.06.22 um 16:55 schrieb Alex Deucher:
On Fri, May 27, 2022 at 8:58 AM Michal Kubecek <mkubecek@xxxxxxx> wrote:
On Fri, May 27, 2022 at 11:00:39AM +0200, Michal Kubecek wrote:
Hello,

while testing 5.19 merge window snapshots (commits babf0bb978e3 and
7e284070abe5), I keep getting errors like below. I have not seen them
with 5.18 final or older.

------------------------------------------------------------------------
[ 247.150333] gmc_v8_0_process_interrupt: 46 callbacks suppressed
[ 247.150336] amdgpu 0000:0c:00.0: amdgpu: GPU fault detected: 147 0x00020802 for process firefox pid 6101 thread firefox:cs0 pid 6116
[ 247.150339] amdgpu 0000:0c:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00107800
[ 247.150340] amdgpu 0000:0c:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0D008002
[ 247.150341] amdgpu 0000:0c:00.0: amdgpu: VM fault (0x02, vmid 6, pasid 32780) at page 1079296, write from 'TC2' (0x54433200) (8)
[...]
[ 249.925909] amdgpu 0000:0c:00.0: amdgpu: IH ring buffer overflow (0x000844C0, 0x00004A00, 0x000044D0)
[ 250.434986] [drm] Fence fallback timer expired on ring sdma0
[ 466.621568] gmc_v8_0_process_interrupt: 122 callbacks suppressed
[...]
------------------------------------------------------------------------

There does not seem to be any apparent immediate problem with graphics
but when running commit babf0bb978e3, there seemed to be a noticeable
lag in some operations, e.g. when moving a window or repainting large
part of the terminal window in konsole (no idea if it's related).

My GPU is Radeon Pro WX 2100 (1002:6995). What other information should
I collect to help debugging the issue?
Bisected to commit 5255e146c99a ("drm/amdgpu: rework TLB flushing").
There seem to be later commits depending on it so I did not test
a revert on top of current mainline.

@Christian Koenig, @Yang, Philip Any ideas? I think there were some
fix ups for this. Maybe those just haven't hit the tree yet?

I need to double check, but as far as I know we have fixed all the fallout.

Could be that something didn't went upstream because it came to late for the merge window.

Christian.


Alex


I should also mention that most commits tested as "bad" during the
bisect did behave much worse than current mainline (errors starting as
early as with sddm, visibly damaged screen content, sometimes even
crashes). But all of them issued messages similar to those above into
kernel log.

Michal Kubecek