Re: [PATCH] drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training

From: Mikhail Gavrilov

Date: Thu Jun 18 2026 - 02:45:16 EST

On Thu, Jun 18, 2026 at 11:17 AM <sashiko-bot@xxxxxxxxxx> wrote:
>
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
>
> Pre-existing issues:
> - [High] Removing the `fs_reclaim` lockdep annotation silences a true-positive warning about a real ABBA deadlock between memory reclaim and GPU reset, rather than fixing the underlying bug.
> --
>
> commit 4896f663f62249014930226b2bc964c2419486d0
> Author: Mikhail Gavrilov <mikhail.v.gavrilov@xxxxxxxxx>
>
> drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training
>
> The commit removes the fs_reclaim_acquire() and fs_reclaim_release() calls
> from amdgpu_lockdep_init() to silence a circular locking dependency warning,
> claiming it is a false positive.
>
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
> > index d5d71fd7c70d0..eeb3b5007f80f 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
> > @@ -150,12 +150,6 @@ int amdgpu_lockdep_init(void)
> > /* Level 6: Reset control lock */
> > mutex_lock(&reset_ctl.reset_lock);
> >
> > - /*
> > - * Mark potential memory reclaim boundary.
> > - * GPU operations might trigger memory allocation/reclaim.
> > - */
> > - fs_reclaim_acquire(GFP_KERNEL);
> > -
>
> [Severity: High]
> This is a pre-existing issue, but does removing this lockdep annotation
> silence a true-positive warning about a real ABBA deadlock?
>
> While the circular dependency might not happen in the same thread, couldn't a
> cross-thread deadlock occur between memory reclaim and GPU reset?
>
> For example, if Thread A executes the MMU notifier amdgpu_hmm_invalidate_gfx()
> (e.g., during direct reclaim), it acquires adev->notifier_lock and blocks in
> dma_resv_wait_timeout() waiting for a hardware fence:
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c:amdgpu_hmm_invalidate_gfx() {
> ...
> mutex_lock(&adev->notifier_lock);
> ...
> r = dma_resv_wait_timeout(vm_root->tbo.base.resv,
> DMA_RESV_USAGE_BOOKKEEP, false,
> MAX_SCHEDULE_TIMEOUT);
> }
>
> If Thread B (the GPU reset thread) takes down_write(&adev->reset_domain->sem)
> to reset the GPU and signal the fences, and performs GFP_KERNEL allocations,
> it could be forced into direct memory reclaim (fs_reclaim). This would invoke
> the MMU notifier, attempting to acquire adev->notifier_lock, forming a cycle.
>
> Would it be better to fix the underlying deadlock by wrapping the GPU reset
> path with memalloc_noreclaim_save() or memalloc_noio_save() to prevent it
> from entering memory reclaim, instead of removing the lockdep annotation?
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260618055216.56191-1-mikhail.v.gavrilov@xxxxxxxxx?part=1

The reported splat is not the reset/reclaim deadlock described here.

It is single-threaded: kswapd holds fs_reclaim and the mmu_notifier range,
then takes notifier_lock in amdgpu_hmm_invalidate_gfx(). That is the normal,
mandatory direction fs_reclaim -> mmu_notifier -> notifier_lock. The MMU
notifier callback runs from inside reclaim, so notifier_lock is acquired
below fs_reclaim and is never held across a reclaiming allocation.

amdgpu_lockdep_init() asserts the opposite edge, notifier_lock -> fs_reclaim,
by calling fs_reclaim_acquire() while notifier_lock is held. That edge does
not occur at runtime, so the reported cycle is a false positive. Dropping the
annotation removes the impossible edge and touches no real lock.

On the cross-thread reset case: if the reset path really holds
reset_domain->sem across a GFP_KERNEL allocation, lockdep learns
reset_sem -> fs_reclaim from that real allocation, not from this annotation.
The fs_reclaim_acquire() here adds nothing for that real edge; it only injects
the impossible notifier_lock -> fs_reclaim one. And because it fires on the
innocent kswapd path, it calls debug_locks_off() and disables lockdep for the
rest of the boot, which would prevent detecting exactly that reset deadlock.

Note memalloc_noreclaim_save() on the reset path would not silence this splat:
the false edge lives in amdgpu_lockdep_init(), independent of the reset path.
The splat reproduces with a userptr BO + MADV_PAGEOUT and is gone after this
change; I verified both.

If reset is confirmed to allocate under reset_domain->sem with reclaim, that
is a real and separate issue and memalloc_noreclaim_save() there would be
reasonable, but it is a different patch and does not change this one.

--
Best Regards,
Mike Gavrilov.