Re: [RFC PATCH v1] dma-fence-array: Deal with sub-fences that are signaled late

From: Jordan Crouse
Date: Mon Aug 17 2020 - 13:21:59 EST


On Thu, Aug 13, 2020 at 07:49:24AM +0100, Chris Wilson wrote:
> Quoting Jordan Crouse (2020-08-13 00:55:44)
> > This is an RFC because I'm still trying to grok the correct behavior.
> >
> > Consider a dma_fence_array created two two fence and signal_on_any is true.
> > A reference to dma_fence_array is taken for each waiting fence.
> >
> > When the client calls dma_fence_wait() only one of the fences is signaled.
> > The client returns successfully from the wait and puts it's reference to
> > the array fence but the array fence still remains because of the remaining
> > un-signaled fence.
> >
> > Now consider that the unsignaled fence is signaled while the timeline is being
> > destroyed much later. The timeline destroy calls dma_fence_signal_locked(). The
> > following sequence occurs:
> >
> > 1) dma_fence_array_cb_func is called
> >
> > 2) array->num_pending is 0 (because it was set to 1 due to signal_on_any) so the
> > callback function calls dma_fence_put() instead of triggering the irq work
> >
> > 3) The array fence is released which in turn puts the lingering fence which is
> > then released
> >
> > 4) deadlock with the timeline
>
> It's the same recursive lock as we previously resolved in sw_sync.c by
> removing the locking from timeline_fence_release().

Ah, yep. I'm working on a not-quite-ready-for-primetime version of a vulkan
timeline implementation for drm/msm and I was doing something similar to how
sw_sync used to work in the release function. Getting rid of the recursive lock
in the timeline seems a better solution than this. Thanks for taking the time
to respond.

Jordan

> -Chris

--
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project