Re: Properly synchronize dma_fence->signaled bit (Was: Re: [RFC PATCH] dma-fence: Fix races of fence callbacks versus destructors by locking)

From: Christian König

Date: Wed Jun 17 2026 - 09:04:30 EST


On 6/17/26 12:16, Philipp Stanner wrote:
> On Wed, 2026-06-17 at 11:46 +0200, Christian König wrote:
...
>>
>>> B.
>>>
>>> I think rejecting ideas with "we tried this, it >>didn't work<<" is not
>>> a valid reason for refusing an idea. Point A above helps with that. If
>>> your commit message contains measurements or links to tickets with
>>> *real life* performance regressions (microbenchmarks are invalid), that
>>> helps reducing discussion overhead drastically.
>>>
>>> Now, in this particular case, I fail to see how taking the spinlock to
>>> check that bit is evil. If it regresses someone's speed that much, it
>>> would mean that someone is heavily punching that lock, like polling
>>> 24/7 with dma_fence_is_signaled().
>>
>> I think (but I'm not 100% sure) the the problem is that taking the
>> spinlock introduces a write to the cache line it is in.
>>
>> At the moment when a fence is signaled a read is enough to check that
>> state, so what happens is that the cache line for the signaled bit
>> sooner or later end up in all CPU caches.
>>
>> When you start to use the spinlock the cache line backing that plays
>> ping/pong between all the CPU cores and that is something which
>> always stalls each CPU when it needs to acquire the cache line. Keep
>> in mind that on a modern box you can calculate like a 4x4 matrix in
>> the same time you solve a cache miss.
>>
>> This is especially important for the stub fence which is used by
>> basically all cores at the same time whenever you need a signaled
>> dummy.
>
> Alright, that sort of sounds logical, I guess. So the argument
> basically is that if we'd try to lock that, someone would immediately
> report real and massive performance regressions leading to a revert.
>
> I think last time you mentioned that memory footprint is less of a
> concern for dma_fence than cache lines. Out of interest: has anyone
> ever experimented with more padding to prevent spinners from shooting
> down other CPUs cache lines?

How would that work in this case? I mean as long as you have the same variable (spinlock) you have the same cache line no matter how you pad.

> Since you're the maintainer of dma-buf, what would you wish we do?

Try to improve the documentation by sending out patches. I will send out my ideas for resilient improvements and we then discuss on the patches.

> Would you be at least OK with the memory barrier approach to make the
> API a bit more robust? AFAIU the barriers will not cause a cache line
> invalidation.

What exactly do you mean with that? The test_bit() and set_bit() are already memory barriers as far as I know.

>>
>>> Again, having that use case documented somewhere could save us all time
>>> – especially for you, Christian, since you wouldn't be forced to have
>>> the same discussion over and over again over the years ;-)
>>
>> Well I could also send out all the DMA-buf resilient patches/ideas I came up with over the years once more.
>
> Maybe we could have sort of a wiki in Documentation/ with links to
> relevant mail threads and some explanations of why things are the way
> they are?

I think some AI analyzing the mailing list and noting when some ideas repeat would help.

At least for me maintaining some kind of Wiki additional to my current workload wouldn't be possible at all.

> btw, is there a dma-buf TODO list like for DRM in general?

No, not that I know of. We used to have minor cleanup tasks on the DRM TODOs, but those were already taken by somebody.

Regards,
Christian.

>
> There are many passionate hackers who love challenges. We could
> certainly add a few "Difficulty: hard" entries for a few controversial
> potential reworks.
>
>
> P.