Re: [RFC PATCH] dma-fence: Fix races of fence callbacks versus destructors by locking

From: Boris Brezillon

Date: Mon Jun 08 2026 - 12:23:45 EST

On Mon, 08 Jun 2026 17:30:58 +0200
Philipp Stanner <phasta@xxxxxxxxxxx> wrote:

> On Mon, 2026-06-08 at 17:23 +0200, Danilo Krummrich wrote:
> > On Mon Jun 8, 2026 at 5:17 PM CEST, Philipp Stanner wrote:
> > > On Mon, 2026-06-08 at 17:01 +0200, Boris Brezillon wrote:
> > > > On Mon, 8 Jun 2026 16:24:37 +0200
> > > > Philipp Stanner <phasta@xxxxxxxxxx> wrote:
> > > >
> > > > > @@ -1020,11 +1024,20 @@ EXPORT_SYMBOL(dma_fence_wait_any_timeout);
> > > > > void dma_fence_set_deadline(struct dma_fence *fence, ktime_t deadline)
> > > > > {
> > > > > const struct dma_fence_ops *ops;
> > > > > + unsigned long flags;
> > > > >
> > > > > rcu_read_lock();
> > > > > ops = rcu_dereference(fence->ops);
> > > > > - if (ops && ops->set_deadline && !dma_fence_is_signaled(fence))
> > > > > + if (!ops || !ops->set_deadline) {
> > > > > + rcu_read_unlock();
> > > > > + return;
> > > > > + }
> > > > > +
> > > > > + dma_fence_lock_irqsave(fence, flags);
> > > > > + if (!dma_fence_is_signaled_locked(fence))
> > > > > ops->set_deadline(fence, deadline);
> > > >
> > > > You can't take the fence lock around ->set_deadline(), otherwise you'll
> > > > deadlock here [1] or here [2].
> > > >
> > > > > +
> > > > > + dma_fence_unlock_irqrestore(fence, flags);
> > > > > rcu_read_unlock();
> > > > > }
> > > >
> > > >
> > > > [1]https://elixir.bootlin.com/linux/v7.0.11/source/drivers/dma-buf/sw_sync.c#L182
> > > > [2]https://elixir.bootlin.com/linux/v7.0.11/source/drivers/gpu/drm/msm/msm_fence.c#L139
>
> Oh, MSM actually doesn't btw, that's a false positive. That's a
> distinct spinlock on their fence context object.

It's not, it's the same lock they attach to all their fences coming
from this context. It's just that this lock appears to be per-context,
like is the case for basically every driver, since the inline lock was
introduced only in this release cycle.

Anyway, this is all stuff we can fix if people think it's okay to
protect dma_fence_ops calls with the fence lock. But my point remains:
each op has its own locking-rules, some are called with the fence lock
held (enable_signaling(), signaled()), others are not (set_deadline(),
get_xxx_name()), so we need to carefully audit each of those to make
sure:

- calling with the lock held in the new places is not causing a
deadlock
- the returned data, if not a scalar, is protected by the RCU read lock
- any driver implementing ops that can be called without the lock held
need to hold on the device data for an RCU grace period

The last bullet is probably the one I'm the most worried about, because
instead of a single rule that applies to all ops, we have various cases
based on whether some ops are implemented or not, but that's already
the case with deprecated ops like .release() or .wait(), so maybe
that's okay with the proper doc.

If I were to choose, I'd probably go for a dedicated rwlock_t to
protect dma_fence_ops, so we can:

- protect all dma_buf_ops::xx() consistently no matter the kind of op
- protect returned data (get_xxx_name()) with this lock instead of the
RCU read lock

But the overhead of this extra lock might not be acceptable, dunno.

>
>
> But yes, before we could upstream this, we would go through all the
> implementors like Danilo did, to find all the others.

There's the two I pointed out, plus the array/chain containers I
mentioned, which are not problematic.