Re: [PATCH] drm/drm_crtc: fix race with dma_fence_signal() in ::get_driver_name()

From: Philipp Stanner

Date: Tue Jun 23 2026 - 07:59:19 EST


On Tue, 2026-06-23 at 12:37 +0100, André Draszik wrote:
> Hi,
>
> On Thu, 2026-06-18 at 17:56 +0200, Philipp Stanner wrote:
> > +Cc Danilo
> >
> > On Thu, 2026-06-18 at 15:03 +0100, André Draszik wrote:
> > > Since commit 541c8f2468b9 ("dma-buf: detach fence ops on signal v3"),
> > > I'm seeing the BUG_ON() triggering in drm_crtc's fence_to_crtc() via
> > > drm_crtc_fence_get_driver_name() regularly:
> > >
> > >     Call trace:
> > >      panic+0x58/0x5c
> > >      die+0x160/0x178
> > >      bug_brk_handler+0x70/0xa4
> > >      call_el1_break_hook+0x3c/0x1a0
> > >      do_el1_brk64+0x24/0x74
> > >      el1_brk64+0x34/0x54
> > >      el1h_64_sync_handler+0x80/0xfc
> > >      el1h_64_sync+0x84/0x88
> > >      drm_crtc_fence_get_driver_name+0x60/0x68 (P)
> > >      sync_file_get_name+0x184/0x45c
> > >      sync_file_ioctl+0x404/0xf70
> > >      __arm64_sys_ioctl+0x124/0x1dc
> > >
> > > This looks to be caused by a code flow similar to the following:
> > >
> > > +++ snip +++
> > > thread A                             thread B
> > >
> > >                                      ioctl(SYNC_IOC_FILE_INFO)
> > >                                      sync_file_ioctl()
> > >                                      sync_file_get_name()
> > > dma_fence_signal_timestamp_locked()  dma_fence_driver_name()
> > >                                        ops = rcu_dereference(fence->ops)
> > >                                        if (!dma_fence_test_signaled_flag())
> > >                                          ops->get_driver_name(fence) i.e.
> > >                                          drm_crtc_fence_get_driver_name()
> > > test_and_set_bit(SIGNALED)
> > > RCU_INIT_POINTER(fence->ops, NULL)
> > >                                      drm_crtc_fence_get_driver_name()
> > >                                        BUG_ON(rcu_access_pointer(fence->ops)
> > >                                               != &drm_crtc_fence_ops)
> >
> > Now this looks like a very similar problem that I have recently been
> > concerned with:
> >
> > https://lore.kernel.org/dri-devel/20260612104251.2264707-2-phasta@xxxxxxxxxx/
> >
> > https://lore.kernel.org/dri-devel/fa0dc9757bf8343516c4b156a2b70ec91b64ef8f.camel@xxxxxxxxxxx/
> >
> >
> > I continue to believe because of bugs like this and the ones I have
> > quoted in the threads above the robustness of the kernel could be
> > greatly improved if we could get dma_fence fully synchronized with its
> > lock.
>
> On top of that, sashiko highlighted  (via my other patch) that the existing
> code is missing some memory barriers:
>
> https://sashiko.dev/#/patchset/20260618-linux-drm_crtc_fix-v1-1-801f29c9853d@xxxxxxxxxx?part=1
>
> I believe Lock synchronization would resolve that (as would adding explicit
> memory barriers).

That is being discussed in the thread I linked, where Gary lists which
barriers you would need for (presumably correct) lockless magic.

However, if my issue were to be solved with barriers, the
test_and_set_bit() in dma_fence_signal_timestamp_locked() would have to
be replaced with the more weakly ordered test_bit() and set_bit(),
maybe creating other pitfalls.

The ordering issue in the get_*_name() functions plays into that.
Setting the bit would then be done after setting the ops-pointer to
NULL. So one would have to try to move the NULL set, too.

Long story short, this is painful and subtle.

But I think what we are realizing over and over again is that dma_fence
has many subtleties to its API contract, and the implementation's
sparring use of spinlocks leads to workarounds where people take locks
manually or have to do an RCU dance.

Note that Christian is strongly opposed to guarding everything with
locks, in part for supposedly occuring deadlocks in the fence callbacks
when the driver needs to take its own locks.

The community discussion regarding that problem is currently in some
sort of dead end, where none of us seems to know what the correct path
forward is.

drm_sched users (and future users in Rust) use intermediate fences
which decouple e.g. userspace from the actual hardware fences. So one
path forward might be to question the callbacks in general and think
about some sort of replacement for them.

>
> [...]
> > >
>
> > Does the CRTC or DRM device need to be kept alive for the RCU grace period,
> > or should the fence hold a proper reference to prevent the use-after-free
> > when get_driver_name() and get_timeline_name() access the freed CRTC
> > structure?
>
> Do you guys have any preference on that? It appears the use-after-free
> should be resolved before merging the removal of the BUG_ON(), and I'd like
> to progress on this.

My understanding of the current situation is that as an issuer of
dma_fence's you, in general, should wait for a grace period until you
perform operations like driver unload, or, more generally, have fence-
related resources and such being accessed through callbacks go away.

Danilo has recently mentioned some life-time inconsistencies between
wider kernel device model and DRM device model that might be related to
that discussion, and which made him object against some RCU
requirements.

Maybe he's got the time to share some details with you that are
relevant to your work.


P.

>
> Cheers,
> Andre'