[PATCH v1 0/2] drm/ttm: fix bulk_move cursor use-after-free for unevictable resources

From: Samuel Ainsworth

Date: Mon Jun 15 2026 - 19:50:02 EST


A resource added to a bo's bulk_move LRU cursor can become unevictable
(pinned or swapped) after it has been added. ttm_resource_del_bulk_move()
then skips removing it -- both on free and on ttm_bo_set_bulk_move() during
bo teardown -- because the resource is unevictable, leaving the cursor's
pos->first/pos->last pointing at it. Once freed, the next allocation on that
bulk_move dereferences the dangling cursor (use-after-free) and corrupts the
LRU list, which CONFIG_DEBUG_LIST turns into a fatal BUG.

In the field this is a hibernation-triggered panic on a Framework 13 (AMD
Ryzen 7040): a buffer swapped out during hibernate is closed after resume
(amdgpu_gem_object_close -> amdgpu_vm_bo_del -> ttm_bo_set_bulk_move()),
which leaves its unevictable resource on the VM's bulk_move cursor; a later
GEM allocation on that cursor then faults (drm/amd issue #5387).

Patch 1 tracks cursor membership explicitly so the del always undoes the
add, regardless of any pin/swap transition. Patch 2 adds kunit regression
coverage.

Validating the bug (no GPU required, patch 2):

- ttm_bo_bulk_move_swapped_free_dangles allocates a resource on a
bulk_move cursor, swaps out its bo's ttm so the resource becomes
unevictable, frees it, and asserts the cursor no longer references the
freed resource. Without the fix this fails: pos->first/pos->last still
equal the freed pointer.

- ttm_bo_bulk_move_dangling_corrupts then allocates on the same
bulk_move; without the fix, KASAN reports a slab-use-after-free in
ttm_resource_add_bulk_move().

On the affected machine, a throwaway debug kernel that WARN_ONCE()s when
ttm_resource_del_bulk_move() skips an unevictable resource still on the
cursor was used to test. It fired during a normal hibernate/resume cycle, via
amdgpu_gem_object_close() -> amdgpu_vm_bo_del() -> ttm_bo_set_bulk_move().
That confirmed the production trigger and that the planted state matches
the field crash (the ttm_resource.c WARN_ON, then list_del corruption).

With the patch 1 fix, both kunit tests pass and the full TTM kunit suite is
green (no KASAN report, no CONFIG_DEBUG_LIST splat). Furthermore, I set up a
kernel with the fix, built with KASAN + CONFIG_DEBUG_LIST + lockdep, and ran 10
hibernate/resume/GEM-close cycles under GPU load with no use-after-free nor LRU
list corruption.

I am new to this area, so review of the approach is very welcome -- in
particular whether tracking membership on the resource is preferrable vs
removing it from the cursor at the point it becomes unevictable.

Samuel Ainsworth (2):
drm/ttm: don't leave bulk_move cursor dangling for unevictable
resources
drm/ttm/tests: add bulk_move cursor regression tests

drivers/gpu/drm/ttm/tests/ttm_bo_test.c | 163 ++++++++++++++++++++++++
drivers/gpu/drm/ttm/ttm_resource.c | 18 ++-
include/drm/ttm/ttm_resource.h | 9 ++
3 files changed, 187 insertions(+), 3 deletions(-)


base-commit: 2c7d5b0a5ec0fc713a7f350806553643e87e6f43
--
2.54.0