Re: Properly synchronize dma_fence->signaled bit (Was: Re: [RFC PATCH] dma-fence: Fix races of fence callbacks versus destructors by locking)
From: Christian König
Date: Mon Jun 15 2026 - 05:58:44 EST
On 6/15/26 10:29, Philipp Stanner wrote:
> On Mon, 2026-06-08 at 20:32 +0200, Christian König wrote:
>> On 6/8/26 19:59, Danilo Krummrich wrote:
>>> On Mon Jun 8, 2026 at 7:34 PM CEST, Christian König wrote:
>>>> That's why we need the RCU grace period to make sure that nobody
>>>> is
>>>> referencing the driver stuff any more.
>>>
>>> Right, and that's what Philipp tries to address, the requirement to
>>> wait for an
>>> RCU grace period is perfectly fine if it is only about freeing
>>> memory, but it
>>> can become painful if the fence private data contains data also
>>> needs to be
>>> destructed in some way.
>>
>> Yeah that makes sense.
>>
>>> IOW, if a driver signals a fence, it is lifecycle-wise reasonable
>>> to destruct
>>> the private data that is no longer needed (remaining users only
>>> deal with struct
>>> dma_fence) and having to wait for a full grace period adds sublety
>>> and
>>> complication that can be avoided with the proposed approach.
>>
>> Yeah, I've run into that when I tried to make the amdgpu fences
>> independent as well.
>>> That said, I'd like to ask the opposite question: What are the
>>> concerns with the
>>> proposed approach over (pure) RCU?
>>
>> For example the reason why we have the dma_fence_is_signaled() and
>> dma_fence_is_signaled_locked() variants is because there is a
>> measurable difference in some specific use cases for not grabbing the
>> locks.
>
> Yeah, certainly, not taking locks makes your code go faster. drm_sched
> can sing a song about that -.-
>
>>
>> I personally find those micro-optimizations rather questionable, but
>> the community agreement is that we should have them.
>
> Yeah so this is most definitely broken and needs to be removed. Proof
> is that various parties already need to work around that issue. You
> just never know what drivers do. It's very conceivable that the Nouveau
> case will often happen:
>
> if (dma_fence_is_signaled(f))
> dma_fence_put(f);
>
>
> This fast path check in my mind certainly breaks the intended dma_fence
> design:
>
> void dma_fence_signal_timestamp_locked(struct dma_fence *fence,
> ktime_t timestamp)
> {
> const struct dma_fence_ops *ops;
> struct dma_fence_cb *cur, *tmp;
> struct list_head cb_list;
>
> dma_fence_assert_held(fence);
>
> if (unlikely(test_and_set_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
> &fence->flags)))
> return;
>
>
> Modifying the bit is consistently done under lock protection, so
> reading must be done, too.
>
> Do you remember who wanted those fast path checks? Who is spinning on
> that lock?
Simona Vetter and basically the rest of the community.
And I can clearly say even if I don't like them that those optimizations are a must have.
> In any case, that needs to be repaired.
No, see my discussion with Simona on the mailing list. I need to dig that up as well, but it was around the time I added the same workaround to amdgpu.
You are basically trying what I have been suggesting as well, but there is a very wide agreement that the current design is a must have.
Regards,
Christian.
> A hacky lockless way might
> hypothetically be doable by setting barriers, as I suggest here:
>
> https://lore.kernel.org/dri-devel/20260612104251.2264707-2-phasta@xxxxxxxxxx/
>
> But note that the ops-decoupling in e.g. dma_fence_timeline_name()
> already depends on lockless ordering mechanisms in
> dma_fence_signal_timestamp_locked(). So we're already a bit fragile
> here.
>
>>
>> So my take would rather be that the dma_fence_is_signaled_locked()
>> variant goes away and we consistently call the ops pointers without
>> holding the dma_fence lock and the driver implementations can then
>> optionally take it if necessary.
>
> That might work.
>
>>
>> I think for this we would just need to replace most calls to
>> dma_fence_is_signaled_locked() with dma_fence_test_signaled().
>
> That would not fix the cleanup race in Nouveau.
>
> I do get the idealistic idea of fence->signaled really just
> representing whether the hardware is done, without any further
> guarantees, but having this fast-path lockless magic in functions
> checking the signaled state is really asking for trouble.
>
> So it would seem that both dma_fence_is_signaled() and
> dma_fence_test_signaled_flag() need to be properly synchronized.
>
>
> P.
>
>>
>> In the long term that would also allow cleaning up the container
>> handling and simplifying the DRM scheduler a bit.
>>
>> Regards,
>> Christian.