Re: Properly synchronize dma_fence->signaled bit (Was: Re: [RFC PATCH] dma-fence: Fix races of fence callbacks versus destructors by locking)

From: Christian König

Date: Wed Jun 17 2026 - 05:57:53 EST


On 6/16/26 13:25, Philipp Stanner wrote:
> On Mon, 2026-06-15 at 11:57 +0200, Christian König wrote:
>> On 6/15/26 10:29, Philipp Stanner wrote:
>>>
>>> This fast path check in my mind certainly breaks the intended dma_fence
>>> design:
>>>
>>> void dma_fence_signal_timestamp_locked(struct dma_fence *fence,
>>>       ktime_t timestamp)
>>> {
>>> const struct dma_fence_ops *ops;
>>> struct dma_fence_cb *cur, *tmp;
>>> struct list_head cb_list;
>>>
>>> dma_fence_assert_held(fence);
>>>
>>> if (unlikely(test_and_set_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
>>>       &fence->flags)))
>>> return;
>>>
>>>
>>> Modifying the bit is consistently done under lock protection, so
>>> reading must be done, too.
>>>
>>> Do you remember who wanted those fast path checks? Who is spinning on
>>> that lock?
>>
>> Simona Vetter and basically the rest of the community.
>
> I did some git-blame and it would seem to me that this fast-path hack
> was added in
>
> e941759c74a4 fence: dma-buf cross-device synchronization (v18)
>
> in 2014 A.D.
>
> So it was there from the very beginning and was not added because there
> was a performance bottleneck later. It is conceivable that a
> performance issue was present from the get go, of course.

Not saying that it wasn't there from the beginning, but multiple people (including me) have tried to improve the situation and that was either immediately reverted or directly rejected.

>>
>> And I can clearly say even if I don't like them that those
>> optimizations are a must have.
>>
>>> In any case, that needs to be repaired.
>>
>> No, see my discussion with Simona on the mailing list. I need to dig
>> that up as well, but it was around the time I added the same
>> workaround to amdgpu.
>>
>> You are basically trying what I have been suggesting as well, but
>> there is a very wide agreement that the current design is a must
>> have.
>
> I suggest at least three things:
>
>
> A.
>
> Very explicitly document all lockless mechanisms and their
> justification in both code comments and commit messages.
>
> * We need to document lockless magic *drastically* better in DRM. I
> see code left and right where there is some barrier with the comment
> simply being "so list_empty() works without a lock".
> * The commit message needs to justify why a lock is missing, why this
> is the preferred solution, why it is correct. The latter also needs
> to be in a code comment.
> * Note that WRITE_ONCE() is not only about volatile, but also about
> "watch out, here is a lockless access!", as Linus pointed out
> repeatedly.

Completely agree. Question is who has time for that?

> B.
>
> I think rejecting ideas with "we tried this, it >>didn't work<<" is not
> a valid reason for refusing an idea. Point A above helps with that. If
> your commit message contains measurements or links to tickets with
> *real life* performance regressions (microbenchmarks are invalid), that
> helps reducing discussion overhead drastically.
>
> Now, in this particular case, I fail to see how taking the spinlock to
> check that bit is evil. If it regresses someone's speed that much, it
> would mean that someone is heavily punching that lock, like polling
> 24/7 with dma_fence_is_signaled().

I think (but I'm not 100% sure) the the problem is that taking the spinlock introduces a write to the cache line it is in.

At the moment when a fence is signaled a read is enough to check that state, so what happens is that the cache line for the signaled bit sooner or later end up in all CPU caches.

When you start to use the spinlock the cache line backing that plays ping/pong between all the CPU cores and that is something which always stalls each CPU when it needs to acquire the cache line. Keep in mind that on a modern box you can calculate like a 4x4 matrix in the same time you solve a cache miss.

This is especially important for the stub fence which is used by basically all cores at the same time whenever you need a signaled dummy.

> Again, having that use case documented somewhere could save us all time
> – especially for you, Christian, since you wouldn't be forced to have
> the same discussion over and over again over the years ;-)

Well I could also send out all the DMA-buf resilient patches/ideas I came up with over the years once more.

I think you will potentially agree to some of them immediately.

> C.
> Robustness and correctness always trump performance. They especially
> trump microbenchmarks.

Yeah agree completely as well. We should also in general not optimize for unrealistic use cases.

For example I had to reject multiple attempts to optimize the dma_fence_chain container for something which only happens in test cases.

Regards,
Christian.

>
>
> P.