Re: [PATCH v1 0/2] drm/ttm: fix bulk_move cursor use-after-free for unevictable resources
From: Samuel Ainsworth
Date: Mon Jun 15 2026 - 23:42:28 EST
Welp, I spoke too soon. v1 is broken.
While stress-testing the fix on the affected machine (KASAN + DEBUG_LIST,
ordinary desktop use rather than only hibernate cycles), I hit a new
use-after-free that this patch introduces:
BUG: KASAN: slab-use-after-free in ttm_lru_bulk_move_tail+0xa93/0xcb0 [ttm]
Read of size 8 ... by task .swayosd-server
ttm_lru_bulk_move_tail+0xa93/0xcb0 [ttm]
amdgpu_vm_move_to_lru_tail+0x29/0x40 [amdgpu]
amdgpu_cs_ioctl+0x4247/0x5090 [amdgpu]
Freed by:
ttm_resource_free+0x1f8/0x390 [ttm]
ttm_bo_delayed_delete+0x69/0x100 [ttm]
By dropping the ttm_resource_unevictable() gate from
ttm_resource_del_bulk_move() and removing any tracked resource, the del now runs
for resources that have become unevictable. ttm_lru_bulk_move_del() updates
pos->first/pos->last by walking the manager LRU
(ttm_lru_next_res()/ttm_lru_prev_res()), but an unevictable resource's lru.link
has been moved to bdev->unevictable, so the walk lands on an unrelated resource
and leaves the cursor pointing outside the bulk-move range. That resource is
freed later, and the next ttm_lru_bulk_move_tail() dereferences the dangling
cursor.
IIUC the resource needs to be taken off the cursor at the point it becomes
unevictable (while its link is still on the manager LRU), not deferred to
free/teardown. The kunit tests passed only because they used a single-element
cursor, where ttm_lru_bulk_move_del() clears pos without walking; they did
not cover the multi-element / moved-link case that this UAF needs.
I am not 100% convinced I have this pinned down yet, but I believe I may be
making progress. Feedback welcome!
I'll try to follow up with a v2 that fixes it at the unevictable transition and
adds tests that actually exercise the multi-element case. Apologies for the
confusion.
Best,
Sam
On Tue, Jun 16, 2026 at 12:46 AM Samuel Ainsworth <skainsworth@xxxxxxxxx> wrote:
>
> Welp, I spoke too soon. v1 is broken.
>
> While stress-testing the fix on the affected machine (KASAN + DEBUG_LIST,
> ordinary desktop use rather than only hibernate cycles), I hit a new
> use-after-free that this patch introduces:
>
> BUG: KASAN: slab-use-after-free in ttm_lru_bulk_move_tail+0xa93/0xcb0 [ttm]
> Read of size 8 ... by task .swayosd-server
> ttm_lru_bulk_move_tail+0xa93/0xcb0 [ttm]
> amdgpu_vm_move_to_lru_tail+0x29/0x40 [amdgpu]
> amdgpu_cs_ioctl+0x4247/0x5090 [amdgpu]
> Freed by:
> ttm_resource_free+0x1f8/0x390 [ttm]
> ttm_bo_delayed_delete+0x69/0x100 [ttm]
>
> By dropping the ttm_resource_unevictable() gate from
> ttm_resource_del_bulk_move() and removing any tracked resource, the del now runs
> for resources that have become unevictable. ttm_lru_bulk_move_del() updates
> pos->first/pos->last by walking the manager LRU
> (ttm_lru_next_res()/ttm_lru_prev_res()), but an unevictable resource's lru.link
> has been moved to bdev->unevictable, so the walk lands on an unrelated resource
> and leaves the cursor pointing outside the bulk-move range. That resource is
> freed later, and the next ttm_lru_bulk_move_tail() dereferences the dangling
> cursor.
>
> IIUC the resource needs to be taken off the cursor at the point it becomes
> unevictable (while its link is still on the manager LRU), not deferred to
> free/teardown. The kunit tests passed only because they used a single-element
> cursor, where ttm_lru_bulk_move_del() clears pos without walking; they did
> not cover the multi-element / moved-link case that this UAF needs.
>
> I am not 100% convinced I have this pinned down yet, but I believe I may be
> making progress. Feedback welcome!
>
> I'll try to follow up with a v2 that fixes it at the unevictable transition and
> adds tests that actually exercise the multi-element case. Apologies for the
> confusion.
>
> Best,
> Sam
>
> On Mon, Jun 15, 2026 at 11:49 PM Samuel Ainsworth <skainsworth@xxxxxxxxx> wrote:
>>
>> A resource added to a bo's bulk_move LRU cursor can become unevictable
>> (pinned or swapped) after it has been added. ttm_resource_del_bulk_move()
>> then skips removing it -- both on free and on ttm_bo_set_bulk_move() during
>> bo teardown -- because the resource is unevictable, leaving the cursor's
>> pos->first/pos->last pointing at it. Once freed, the next allocation on that
>> bulk_move dereferences the dangling cursor (use-after-free) and corrupts the
>> LRU list, which CONFIG_DEBUG_LIST turns into a fatal BUG.
>>
>> In the field this is a hibernation-triggered panic on a Framework 13 (AMD
>> Ryzen 7040): a buffer swapped out during hibernate is closed after resume
>> (amdgpu_gem_object_close -> amdgpu_vm_bo_del -> ttm_bo_set_bulk_move()),
>> which leaves its unevictable resource on the VM's bulk_move cursor; a later
>> GEM allocation on that cursor then faults (drm/amd issue #5387).
>>
>> Patch 1 tracks cursor membership explicitly so the del always undoes the
>> add, regardless of any pin/swap transition. Patch 2 adds kunit regression
>> coverage.
>>
>> Validating the bug (no GPU required, patch 2):
>>
>> - ttm_bo_bulk_move_swapped_free_dangles allocates a resource on a
>> bulk_move cursor, swaps out its bo's ttm so the resource becomes
>> unevictable, frees it, and asserts the cursor no longer references the
>> freed resource. Without the fix this fails: pos->first/pos->last still
>> equal the freed pointer.
>>
>> - ttm_bo_bulk_move_dangling_corrupts then allocates on the same
>> bulk_move; without the fix, KASAN reports a slab-use-after-free in
>> ttm_resource_add_bulk_move().
>>
>> On the affected machine, a throwaway debug kernel that WARN_ONCE()s when
>> ttm_resource_del_bulk_move() skips an unevictable resource still on the
>> cursor was used to test. It fired during a normal hibernate/resume cycle, via
>> amdgpu_gem_object_close() -> amdgpu_vm_bo_del() -> ttm_bo_set_bulk_move().
>> That confirmed the production trigger and that the planted state matches
>> the field crash (the ttm_resource.c WARN_ON, then list_del corruption).
>>
>> With the patch 1 fix, both kunit tests pass and the full TTM kunit suite is
>> green (no KASAN report, no CONFIG_DEBUG_LIST splat). Furthermore, I set up a
>> kernel with the fix, built with KASAN + CONFIG_DEBUG_LIST + lockdep, and ran 10
>> hibernate/resume/GEM-close cycles under GPU load with no use-after-free nor LRU
>> list corruption.
>>
>> I am new to this area, so review of the approach is very welcome -- in
>> particular whether tracking membership on the resource is preferrable vs
>> removing it from the cursor at the point it becomes unevictable.
>>
>> Samuel Ainsworth (2):
>> drm/ttm: don't leave bulk_move cursor dangling for unevictable
>> resources
>> drm/ttm/tests: add bulk_move cursor regression tests
>>
>> drivers/gpu/drm/ttm/tests/ttm_bo_test.c | 163 ++++++++++++++++++++++++
>> drivers/gpu/drm/ttm/ttm_resource.c | 18 ++-
>> include/drm/ttm/ttm_resource.h | 9 ++
>> 3 files changed, 187 insertions(+), 3 deletions(-)
>>
>>
>> base-commit: 2c7d5b0a5ec0fc713a7f350806553643e87e6f43
>> --
>> 2.54.0