Re: [RFC PATCH] dma-fence: Fix races of fence callbacks versus destructors by locking

From: Christian König

Date: Tue Jun 09 2026 - 04:09:46 EST

On 6/8/26 18:16, Boris Brezillon wrote:
> On Mon, 08 Jun 2026 17:30:58 +0200
> Philipp Stanner <phasta@xxxxxxxxxxx> wrote:
>
>> On Mon, 2026-06-08 at 17:23 +0200, Danilo Krummrich wrote:
>>> On Mon Jun 8, 2026 at 5:17 PM CEST, Philipp Stanner wrote:
>>>> On Mon, 2026-06-08 at 17:01 +0200, Boris Brezillon wrote:
>>>>> On Mon, 8 Jun 2026 16:24:37 +0200
>>>>> Philipp Stanner <phasta@xxxxxxxxxx> wrote:
>>>>>
>>>>>> @@ -1020,11 +1024,20 @@ EXPORT_SYMBOL(dma_fence_wait_any_timeout);
>>>>>> void dma_fence_set_deadline(struct dma_fence *fence, ktime_t deadline)
>>>>>> {
>>>>>> const struct dma_fence_ops *ops;
>>>>>> + unsigned long flags;
>>>>>>
>>>>>> rcu_read_lock();
>>>>>> ops = rcu_dereference(fence->ops);
>>>>>> - if (ops && ops->set_deadline && !dma_fence_is_signaled(fence))
>>>>>> + if (!ops || !ops->set_deadline) {
>>>>>> + rcu_read_unlock();
>>>>>> + return;
>>>>>> + }
>>>>>> +
>>>>>> + dma_fence_lock_irqsave(fence, flags);
>>>>>> + if (!dma_fence_is_signaled_locked(fence))
>>>>>> ops->set_deadline(fence, deadline);
>>>>>
>>>>> You can't take the fence lock around ->set_deadline(), otherwise you'll
>>>>> deadlock here [1] or here [2].
>>>>>
>>>>>> +
>>>>>> + dma_fence_unlock_irqrestore(fence, flags);
>>>>>> rcu_read_unlock();
>>>>>> }
>>>>>
>>>>>
>>>>> [1]https://elixir.bootlin.com/linux/v7.0.11/source/drivers/dma-buf/sw_sync.c#L182
>>>>> [2]https://elixir.bootlin.com/linux/v7.0.11/source/drivers/gpu/drm/msm/msm_fence.c#L139
>>
>> Oh, MSM actually doesn't btw, that's a false positive. That's a
>> distinct spinlock on their fence context object.
>
> It's not, it's the same lock they attach to all their fences coming
> from this context. It's just that this lock appears to be per-context,
> like is the case for basically every driver, since the inline lock was
> introduced only in this release cycle.
>
> Anyway, this is all stuff we can fix if people think it's okay to
> protect dma_fence_ops calls with the fence lock. But my point remains:
> each op has its own locking-rules, some are called with the fence lock
> held (enable_signaling(), signaled()), others are not (set_deadline(),
> get_xxx_name()), so we need to carefully audit each of those to make
> sure:
>
> - calling with the lock held in the new places is not causing a
> deadlock

Yeah, exactly that.

For example the set_deadline() is intentionally not called with the fence lock held because that won't work for some use cases.

The problem when you call ops with a lock held is always that this lock then becomes the outermost look held. In other words when you for example want to grab a power management lock to implement the deadline feature the framework enforces an order between the two locks which isn't desired.

Regards,
Christian.

> - the returned data, if not a scalar, is protected by the RCU read lock
> - any driver implementing ops that can be called without the lock held
> need to hold on the device data for an RCU grace period
>
> The last bullet is probably the one I'm the most worried about, because
> instead of a single rule that applies to all ops, we have various cases
> based on whether some ops are implemented or not, but that's already
> the case with deprecated ops like .release() or .wait(), so maybe
> that's okay with the proper doc.
>
> If I were to choose, I'd probably go for a dedicated rwlock_t to
> protect dma_fence_ops, so we can:
>
> - protect all dma_buf_ops::xx() consistently no matter the kind of op
> - protect returned data (get_xxx_name()) with this lock instead of the
> RCU read lock
>
> But the overhead of this extra lock might not be acceptable, dunno.
>
>>
>>
>> But yes, before we could upstream this, we would go through all the
>> implementors like Danilo did, to find all the others.
>
> There's the two I pointed out, plus the array/chain containers I
> mentioned, which are not problematic.