Re: [PATCH] drm/nouveau: Document weird looking bugfix

From: Tvrtko Ursulin

Date: Wed Jun 10 2026 - 08:16:48 EST



On 10/06/2026 09:26, Philipp Stanner wrote:
commit c8a5d5ea3ba6 ("nouveau: fix client work fence deletion race")
fixed a race. To do so, it replaced the automatically locking
dma_fence_is_signaled() with manual locks plus
dma_fence_is_signaled_locked().

For someone browsing through the code, this reads very much like a
cleanup or rework leftover. Future contributors and / or new maintainers
not familiar with the history might be tempted to remove that bugfix.

Document the bugfix.

Signed-off-by: Philipp Stanner <phasta@xxxxxxxxxx>
---
(I did not test this)
---
drivers/gpu/drm/nouveau/nouveau_drm.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c
index 42a81166f3a9..519a0c164a72 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
@@ -159,6 +159,13 @@ nouveau_cli_work_ready(struct dma_fence *fence)
unsigned long flags;
bool ret = true;
+ /*
+ * This is not a cleanup / rework leftover, but a bugfix to prevent a
+ * race with someone signalling the fence. The locked
+ * dma_fence_is_signaled() cannot be used. The dma_fence implementation
+ * is not fully synchronized with locks, but also uses atomic bits,
+ * which can cause the dma_fence_put() below to be executed too soon.
+ */

IMHO it would also be interesting to document why this happens from the nouveau point of view.

For example I see the two references held on this fences in the call chain, but apparently neither are enough to close the race. Which suggests a third party has a pointer to this fence but with no reference.

I talk about this:

nouveau_gem_object_unmap -> nouveau_cli_work_queue

There it grabs a reference before queing the worker. In the worker it drops it before calling the callback nouveau_gem_object_unmap installed:

static void
nouveau_cli_work(struct work_struct *w)
{
struct nouveau_cli *cli = container_of(w, typeof(*cli), work);
struct nouveau_cli_work *work, *wtmp;
mutex_lock(&cli->lock);
list_for_each_entry_safe(work, wtmp, &cli->worker, head) {
if (!work->fence || nouveau_cli_work_ready(work->fence)) {

... nouveau_cli_work_ready can drop one reference

list_del(&work->head);
work->func(work);

... then work->func was set to nouveau_gem_object_delete_work by nouveau_gem_object_unmap, which will end up calling:

nouveau_gem_object_delete -> nouveau_fence_unref

On possibly the same fence.

So if there a path inside nouveau itself which signals the fence without holding a reference then could be it that the problem is self-inflicted and not due a dma-fence quirks?

I am not entirely sure since it is not very clear. It needs someone with nouveau expertise to clarify.

Regards,

Tvrtko

dma_fence_lock_irqsave(fence, flags);
if (!dma_fence_is_signaled_locked(fence))
ret = false;