Properly synchronize dma_fence->signaled bit (Was: Re: [RFC PATCH] dma-fence: Fix races of fence callbacks versus destructors by locking)

From: Philipp Stanner

Date: Mon Jun 15 2026 - 04:29:45 EST

On Mon, 2026-06-08 at 20:32 +0200, Christian König wrote:
> On 6/8/26 19:59, Danilo Krummrich wrote:
> > On Mon Jun 8, 2026 at 7:34 PM CEST, Christian König wrote:
> > > That's why we need the RCU grace period to make sure that nobody
> > > is
> > > referencing the driver stuff any more.
> >
> > Right, and that's what Philipp tries to address, the requirement to
> > wait for an
> > RCU grace period is perfectly fine if it is only about freeing
> > memory, but it
> > can become painful if the fence private data contains data also
> > needs to be
> > destructed in some way.
>
> Yeah that makes sense.
>
> > IOW, if a driver signals a fence, it is lifecycle-wise reasonable
> > to destruct
> > the private data that is no longer needed (remaining users only
> > deal with struct
> > dma_fence) and having to wait for a full grace period adds sublety
> > and
> > complication that can be avoided with the proposed approach.
>
> Yeah, I've run into that when I tried to make the amdgpu fences
> independent as well.
> > That said, I'd like to ask the opposite question: What are the
> > concerns with the
> > proposed approach over (pure) RCU?
>
> For example the reason why we have the dma_fence_is_signaled() and
> dma_fence_is_signaled_locked() variants is because there is a
> measurable difference in some specific use cases for not grabbing the
> locks.

Yeah, certainly, not taking locks makes your code go faster. drm_sched
can sing a song about that -.-

>
> I personally find those micro-optimizations rather questionable, but
> the community agreement is that we should have them.

Yeah so this is most definitely broken and needs to be removed. Proof
is that various parties already need to work around that issue. You
just never know what drivers do. It's very conceivable that the Nouveau
case will often happen:

if (dma_fence_is_signaled(f))
dma_fence_put(f);

This fast path check in my mind certainly breaks the intended dma_fence
design:

void dma_fence_signal_timestamp_locked(struct dma_fence *fence,
ktime_t timestamp)
{
const struct dma_fence_ops *ops;
struct dma_fence_cb *cur, *tmp;
struct list_head cb_list;

dma_fence_assert_held(fence);

if (unlikely(test_and_set_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
&fence->flags)))
return;

Modifying the bit is consistently done under lock protection, so
reading must be done, too.

Do you remember who wanted those fast path checks? Who is spinning on
that lock?

In any case, that needs to be repaired. A hacky lockless way might
hypothetically be doable by setting barriers, as I suggest here:

https://lore.kernel.org/dri-devel/20260612104251.2264707-2-phasta@xxxxxxxxxx/

But note that the ops-decoupling in e.g. dma_fence_timeline_name()
already depends on lockless ordering mechanisms in
dma_fence_signal_timestamp_locked(). So we're already a bit fragile
here.

>
> So my take would rather be that the dma_fence_is_signaled_locked()
> variant goes away and we consistently call the ops pointers without
> holding the dma_fence lock and the driver implementations can then
> optionally take it if necessary.

That might work.

>
> I think for this we would just need to replace most calls to
> dma_fence_is_signaled_locked() with dma_fence_test_signaled().

That would not fix the cleanup race in Nouveau.

I do get the idealistic idea of fence->signaled really just
representing whether the hardware is done, without any further
guarantees, but having this fast-path lockless magic in functions
checking the signaled state is really asking for trouble.

So it would seem that both dma_fence_is_signaled() and
dma_fence_test_signaled_flag() need to be properly synchronized.

P.

>
> In the long term that would also allow cleaning up the container
> handling and simplifying the DRM scheduler a bit.
>
> Regards,
> Christian.