Re: [amdgpu] deadlock

From: Alex Deucher
Date: Wed Feb 03 2021 - 08:57:28 EST

Next message: Jason Gunthorpe: "Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd"
Previous message: David Hildenbrand: "Re: [PATCH 2/2] KVM: Scalable memslots implementation"
In reply to: Christian König: "Re: [amdgpu] deadlock"
Next in thread: Daniel Vetter: "Re: [amdgpu] deadlock"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Feb 3, 2021 at 7:30 AM Christian König <christian.koenig@xxxxxxx> wrote:
>
> Am 03.02.21 um 13:24 schrieb Daniel Vetter:
> > On Wed, Feb 03, 2021 at 01:21:20PM +0100, Christian König wrote:
> >> Am 03.02.21 um 12:45 schrieb Daniel Gomez:
> >>> On Wed, 3 Feb 2021 at 10:47, Daniel Gomez <daniel@xxxxxxxx> wrote:
> >>>> On Wed, 3 Feb 2021 at 10:17, Daniel Vetter <daniel@xxxxxxxx> wrote:
> >>>>> On Wed, Feb 3, 2021 at 9:51 AM Christian König <christian.koenig@xxxxxxx> wrote:
> >>>>>> Am 03.02.21 um 09:48 schrieb Daniel Vetter:
> >>>>>>> On Wed, Feb 3, 2021 at 9:36 AM Christian König <christian.koenig@xxxxxxx> wrote:
> >>>>>>>> Hi Daniel,
> >>>>>>>>
> >>>>>>>> this is not a deadlock, but rather a hardware lockup.
> >>>>>>> Are you sure? Ime getting stuck in dma_fence_wait has generally good
> >>>>>>> chance of being a dma_fence deadlock. GPU hang should never result in
> >>>>>>> a forever stuck dma_fence.
> >>>>>> Yes, I'm pretty sure. Otherwise the hardware clocks wouldn't go up like
> >>>>>> this.
> >>>>> Maybe clarifying, could be both. TDR should notice and get us out of
> >>>>> this, but if there's a dma_fence deadlock and we can't re-emit or
> >>>>> force complete the pending things, then we're stuck for good.
> >>>>> -Daniel
> >>>>>
> >>>>>> Question is rather why we end up in the userptr handling for GFX? Our
> >>>>>> ROCm OpenCL stack shouldn't use this.
> >>>>>>
> >>>>>>> Daniel, can you pls re-hang your machine and then dump backtraces of
> >>>>>>> all tasks into dmesg with sysrq-t, and then attach that? Without all
> >>>>>>> the backtraces it's tricky to construct the full dependency chain of
> >>>>>>> what's going on. Also is this plain -rc6, not some more patches on
> >>>>>>> top?
> >>>>>> Yeah, that's still a good idea to have.
> >>>> Here the full backtrace dmesg logs after the hang:
> >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2Fkzivm2L3&data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885971019%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=a3934SOOSFtRU3RraUe%2BWDgAEDefENxQZcd0prmSZXs%3D&reserved=0
> >>>>
> >>>> This is another dmesg log with the backtraces after SIGKILL the matrix process:
> >>>> (I didn't have the sysrq enable at the time):
> >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2FpRBwGcj1&data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885981018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=nPom9VwIrEZF02hSEnC5Ef8lHdQURMELCapIhwKk2JE%3D&reserved=0
> >>> I've now removed all our v4l2 patches and did the same test with the 'plain'
> >>> mainline version (-rc6).
> >>>
> >>> Reference: 3aaf0a27ffc29b19a62314edd684b9bc6346f9a8
> >>>
> >>> Same error, same behaviour. Full dmesg log attached:
> >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2Fraw%2FKgaEf7Y1&data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885981018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=WQw6g9oA38aT1VuuZ8%2F1Y43pG%2BPlV%2F9%2FRHjKdGvZLK4%3D&reserved=0
> >>> Note:
> >>> dmesg with sysrq-t before running the test starts in [ 122.016502]
> >>> sysrq: Show State
> >>> dmesg with sysrq-t after the test starts in: [ 495.587671] sysrq: Show State
> >> There is nothing amdgpu related in there except for waiting for the
> >> hardware.
> > Yeah, but there's also no other driver that could cause a stuck dma_fence,
> > so why is reset not cleaning up the mess here? Irrespective of why the gpu
> > is stuck, the kernel should at least complete all the dma_fences even if
> > the gpu for some reason is terminally ill ...
>
> That's a good question as well. I'm digging into this.
>
> My best theory is that the amdgpu packages disabled GPU reset for some
> reason.

The timeout for compute queues is infinite because of long running
compute kernels. You can override with the amdgpu.lockup_timeout
parameter.

Alex

>
> But the much more interesting question is why we end up in this call
> path. I've pinged internally, but east coast is not awake yet :)
>
> Christian.
>
> > -Daniel
> >
> >> This is a pretty standard hardware lockup, but I'm still waiting for an
> >> explanation why we end up in this call path in the first place.
> >>
> >> Christian.
> >>
> >>>
> >>>>>> Christian.
> >>>>>>
> >>>>>>> -Daniel
> >>>>>>>
> >>>>>>>> Which OpenCl stack are you using?
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Christian.
> >>>>>>>>
> >>>>>>>> Am 03.02.21 um 09:33 schrieb Daniel Gomez:
> >>>>>>>>> Hi all,
> >>>>>>>>>
> >>>>>>>>> I have a deadlock with the amdgpu mainline driver when running in parallel two
> >>>>>>>>> OpenCL applications. So far, we've been able to replicate it easily by executing
> >>>>>>>>> clinfo and MatrixMultiplication (from AMD opencl-samples). It's quite old the
> >>>>>>>>> opencl-samples so, if you have any other suggestion for testing I'd be very
> >>>>>>>>> happy to test it as well.
> >>>>>>>>>
> >>>>>>>>> How to replicate the issue:
> >>>>>>>>>
> >>>>>>>>> # while true; do /usr/bin/MatrixMultiplication --device gpu \
> >>>>>>>>> --deviceId 0 -x 1000 -y 1000 -z 1000 -q -t -i 50; done
> >>>>>>>>> # while true; do clinfo; done
> >>>>>>>>>
> >>>>>>>>> Output:
> >>>>>>>>>
> >>>>>>>>> After a minute or less (sometimes could be more) I can see that
> >>>>>>>>> MatrixMultiplication and clinfo hang. In addition, with radeontop you can see
> >>>>>>>>> how the Graphics pipe goes from ~50% to 100%. Also the shader clocks
> >>>>>>>>> goes up from ~35% to ~96%.
> >>>>>>>>>
> >>>>>>>>> clinfo keeps printing:
> >>>>>>>>> ioctl(7, DRM_IOCTL_SYNCOBJ_WAIT, 0x7ffe46e5f950) = -1 ETIME (Timer expired)
> >>>>>>>>>
> >>>>>>>>> And MatrixMultiplication prints the following (strace) if you try to
> >>>>>>>>> kill the process:
> >>>>>>>>>
> >>>>>>>>> sched_yield() = 0
> >>>>>>>>> futex(0x557e945343b8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0,
> >>>>>>>>> NULL, FUTEX_BITSET_MATCH_ANYstrace: Process 651 detached
> >>>>>>>>> <detached ...>
> >>>>>>>>>
> >>>>>>>>> After this, the gpu is not functional at all and you'd need a power cycle reset
> >>>>>>>>> to restore the system.
> >>>>>>>>>
> >>>>>>>>> Hardware info:
> >>>>>>>>> CPU: AMD Ryzen Embedded V1605B with Radeon Vega Gfx (8) @ 2.000GHz
> >>>>>>>>> GPU: AMD ATI Radeon Vega Series / Radeon Vega Mobile Series
> >>>>>>>>>
> >>>>>>>>> 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> >>>>>>>>> [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series]
> >>>>>>>>> (rev 83)
> >>>>>>>>> DeviceName: Broadcom 5762
> >>>>>>>>> Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge
> >>>>>>>>> [Radeon Vega Series / Radeon Vega Mobile Series]
> >>>>>>>>> Kernel driver in use: amdgpu
> >>>>>>>>> Kernel modules: amdgpu
> >>>>>>>>>
> >>>>>>>>> Linux kernel info:
> >>>>>>>>>
> >>>>>>>>> root@qt5222:~# uname -a
> >>>>>>>>> Linux qt5222 5.11.0-rc6-qtec-standard #2 SMP Tue Feb 2 09:41:46 UTC
> >>>>>>>>> 2021 x86_64 x86_64 x86_64 GNU/Linux
> >>>>>>>>>
> >>>>>>>>> By enabling the kernel locks stats I could see the MatrixMultiplication is
> >>>>>>>>> hanged in the amdgpu_mn_invalidate_gfx function:
> >>>>>>>>>
> >>>>>>>>> [ 738.359202] 1 lock held by MatrixMultiplic/653:
> >>>>>>>>> [ 738.359206] #0: ffff88810e364fe0
> >>>>>>>>> (&adev->notifier_lock){+.+.}-{3:3}, at:
> >>>>>>>>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> >>>>>>>>>
> >>>>>>>>> I can see in the the amdgpu_mn_invalidate_gfx function: the
> >>>>>>>>> dma_resv_wait_timeout_rcu uses wait_all (fences) and MAX_SCHEDULE_TIMEOUT so, I
> >>>>>>>>> guess the code gets stuck there waiting forever. According to the
> >>>>>>>>> documentation: "When somebody tries to invalidate the page tables we block the
> >>>>>>>>> update until all operations on the pages in question are completed, then those
> >>>>>>>>> pages are marked as accessed and also dirty if it wasn’t a read only access."
> >>>>>>>>> Looks like the fences are deadlocked and therefore, it never returns. Could it
> >>>>>>>>> be possible? any hint to where can I look to fix this?
> >>>>>>>>>
> >>>>>>>>> Thank you in advance.
> >>>>>>>>>
> >>>>>>>>> Here the full dmesg output:
> >>>>>>>>>
> >>>>>>>>> [ 738.337726] INFO: task MatrixMultiplic:653 blocked for more than 122 seconds.
> >>>>>>>>> [ 738.344937] Not tainted 5.11.0-rc6-qtec-standard #2
> >>>>>>>>> [ 738.350384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >>>>>>>>> disables this message.
> >>>>>>>>> [ 738.358240] task:MatrixMultiplic state:D stack: 0 pid: 653
> >>>>>>>>> ppid: 1 flags:0x00004000
> >>>>>>>>> [ 738.358254] Call Trace:
> >>>>>>>>> [ 738.358261] ? dma_fence_default_wait+0x1eb/0x230
> >>>>>>>>> [ 738.358276] __schedule+0x370/0x960
> >>>>>>>>> [ 738.358291] ? dma_fence_default_wait+0x117/0x230
> >>>>>>>>> [ 738.358297] ? dma_fence_default_wait+0x1eb/0x230
> >>>>>>>>> [ 738.358305] schedule+0x51/0xc0
> >>>>>>>>> [ 738.358312] schedule_timeout+0x275/0x380
> >>>>>>>>> [ 738.358324] ? dma_fence_default_wait+0x1eb/0x230
> >>>>>>>>> [ 738.358332] ? mark_held_locks+0x4f/0x70
> >>>>>>>>> [ 738.358341] ? dma_fence_default_wait+0x117/0x230
> >>>>>>>>> [ 738.358347] ? lockdep_hardirqs_on_prepare+0xd4/0x180
> >>>>>>>>> [ 738.358353] ? _raw_spin_unlock_irqrestore+0x39/0x40
> >>>>>>>>> [ 738.358362] ? dma_fence_default_wait+0x117/0x230
> >>>>>>>>> [ 738.358370] ? dma_fence_default_wait+0x1eb/0x230
> >>>>>>>>> [ 738.358375] dma_fence_default_wait+0x214/0x230
> >>>>>>>>> [ 738.358384] ? dma_fence_release+0x1a0/0x1a0
> >>>>>>>>> [ 738.358396] dma_fence_wait_timeout+0x105/0x200
> >>>>>>>>> [ 738.358405] dma_resv_wait_timeout_rcu+0x1aa/0x5e0
> >>>>>>>>> [ 738.358421] amdgpu_mn_invalidate_gfx+0x55/0xa0 [amdgpu]
> >>>>>>>>> [ 738.358688] __mmu_notifier_release+0x1bb/0x210
> >>>>>>>>> [ 738.358710] exit_mmap+0x2f/0x1e0
> >>>>>>>>> [ 738.358723] ? find_held_lock+0x34/0xa0
> >>>>>>>>> [ 738.358746] mmput+0x39/0xe0
> >>>>>>>>> [ 738.358756] do_exit+0x5c3/0xc00
> >>>>>>>>> [ 738.358763] ? find_held_lock+0x34/0xa0
> >>>>>>>>> [ 738.358780] do_group_exit+0x47/0xb0
> >>>>>>>>> [ 738.358791] get_signal+0x15b/0xc50
> >>>>>>>>> [ 738.358807] arch_do_signal_or_restart+0xaf/0x710
> >>>>>>>>> [ 738.358816] ? lockdep_hardirqs_on_prepare+0xd4/0x180
> >>>>>>>>> [ 738.358822] ? _raw_spin_unlock_irqrestore+0x39/0x40
> >>>>>>>>> [ 738.358831] ? ktime_get_mono_fast_ns+0x50/0xa0
> >>>>>>>>> [ 738.358844] ? amdgpu_drm_ioctl+0x6b/0x80 [amdgpu]
> >>>>>>>>> [ 738.359044] exit_to_user_mode_prepare+0xf2/0x1b0
> >>>>>>>>> [ 738.359054] syscall_exit_to_user_mode+0x19/0x60
> >>>>>>>>> [ 738.359062] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>>>>>>>> [ 738.359069] RIP: 0033:0x7f6b89a51887
> >>>>>>>>> [ 738.359076] RSP: 002b:00007f6b82b54b18 EFLAGS: 00000246 ORIG_RAX:
> >>>>>>>>> 0000000000000010
> >>>>>>>>> [ 738.359086] RAX: fffffffffffffe00 RBX: 00007f6b82b54b50 RCX: 00007f6b89a51887
> >>>>>>>>> [ 738.359091] RDX: 00007f6b82b54b50 RSI: 00000000c02064c3 RDI: 0000000000000007
> >>>>>>>>> [ 738.359096] RBP: 00000000c02064c3 R08: 0000000000000003 R09: 00007f6b82b54bbc
> >>>>>>>>> [ 738.359101] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000165a0bc00
> >>>>>>>>> [ 738.359106] R13: 0000000000000007 R14: 0000000000000001 R15: 0000000000000000
> >>>>>>>>> [ 738.359129]
> >>>>>>>>> Showing all locks held in the system:
> >>>>>>>>> [ 738.359141] 1 lock held by khungtaskd/54:
> >>>>>>>>> [ 738.359148] #0: ffffffff829f6840 (rcu_read_lock){....}-{1:2}, at:
> >>>>>>>>> debug_show_all_locks+0x15/0x183
> >>>>>>>>> [ 738.359187] 1 lock held by systemd-journal/174:
> >>>>>>>>> [ 738.359202] 1 lock held by MatrixMultiplic/653:
> >>>>>>>>> [ 738.359206] #0: ffff88810e364fe0
> >>>>>>>>> (&adev->notifier_lock){+.+.}-{3:3}, at:
> >>>>>>>>> amdgpu_mn_invalidate_gfx+0x34/0xa0 [amdgpu]
> >>>>>>>>>
> >>>>>>>>> Daniel
> >>>>>>>> _______________________________________________
> >>>>>>>> dri-devel mailing list
> >>>>>>>> dri-devel@xxxxxxxxxxxxxxxxxxxxx
> >>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885981018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=OkFv8jiehNoa46Q%2B5yOXUg29cRbzl8voV2GqC8j1V9Q%3D&reserved=0
> >>>>> --
> >>>>> Daniel Vetter
> >>>>> Software Engineer, Intel Corporation
> >>>>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885981018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=m0e9DrqnuYQoJYwwZAyonKlSfkp9hFTRNoT53OY3IbU%3D&reserved=0
> >>> _______________________________________________
> >>> amd-gfx mailing list
> >>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=04%7C01%7Cchristian.koenig%40amd.com%7C04065956e74d4ea73b2408d8c83eb15a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637479518885981018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=BuUCnnGsKhSQc0ldgBPVBIQxYUnvIPwqqLMe81ynrgY%3D&reserved=0
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@xxxxxxxxxxxxxxxxxxxxx
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

Next message: Jason Gunthorpe: "Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd"
Previous message: David Hildenbrand: "Re: [PATCH 2/2] KVM: Scalable memslots implementation"
In reply to: Christian König: "Re: [amdgpu] deadlock"
Next in thread: Daniel Vetter: "Re: [amdgpu] deadlock"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]