Re: [PATCH 3/3] drm/scheduler: Clean up jobs when the scheduler is torn down.

From: Luben Tuikov
Date: Wed Aug 02 2023 - 10:13:15 EST


On 2023-08-02 00:06, Matthew Brost wrote:
> On Mon, Jul 17, 2023 at 01:40:38PM -0400, Luben Tuikov wrote:
>> On 2023-07-16 03:51, Asahi Lina wrote:
>>> On 15/07/2023 16.14, Luben Tuikov wrote:
>>>> On 2023-07-14 04:21, Asahi Lina wrote:
>>>>> drm_sched_fini() currently leaves any pending jobs dangling, which
>>>>> causes segfaults and other badness when job completion fences are
>>>>> signaled after the scheduler is torn down.
>>>>
>>>> If there are pending jobs, ideally we want to call into the driver,
>>>> so that it can release resources it may be holding for those.
>>>> The idea behind "pending" is that they are pending in the hardware
>>>> and we don't know their state until signalled/the callback called.
>>>> (Or unless the device is reset and we get a notification of that fact.)
>>>
>>> That's what the job->free_job() callback does, then the driver is free
>>> to do whatever it wants with those jobs. A driver could opt to
>>> synchronously kill those jobs (if it can) or account for them
>>> separately/asynchronously.
>>>
>>> What this patch basically says is that if you destroy a scheduler with
>>> pending jobs, it immediately considers them terminated with an error,
>>> and returns ownership back to the driver for freeing. Then the driver
>>> can decide how to handle the rest and whatever the underlying hardware
>>> state is.
>>>
>>>>> Explicitly detach all jobs from their completion callbacks and free
>>>>> them. This makes it possible to write a sensible safe abstraction for
>>>>> drm_sched, without having to externally duplicate the tracking of
>>>>> in-flight jobs.
>>>>>
>>>>> This shouldn't regress any existing drivers, since calling
>>>>> drm_sched_fini() with any pending jobs is broken and this change should
>>>>> be a no-op if there are no pending jobs.
>>>>
>>>> While this statement is true on its own, it kind of contradicts
>>>> the premise of the first paragraph.
>>>
>>> I mean right *now* it's broken, before this patch. I'm trying to make it
>>> safe, but it shouldn't regress any exiting drivers since if they trigger
>>> this code path they are broken today.
>>
>> Not sure about other drivers--they can speak for themselves and the CC list
>> should include them--please use "dim add-missing-cc" and make sure
>> that the Git commit description contains the Cc tags--then git send-email
>> will populate the SMTP CC. Feel free to add more Cc tags on top of that.
>>
>
> Xe doesn't need this as our reference counting scheme doesn't allow
> drm_sched_fini to be called when jobs are pending. If we want to
> teardown a drm_sched we set TDR timeout to zero and all pending jobs
> gets cleaned up that way, the ref of sched will go to zero, and
> drm_sched_fini is called. The caveat here being I think we need a worker
> to call drm_sched_fini as the last ref to scheduler might be dropped
> from within scheduler main thread.
>
> That being said, I doubt this patch breaks anything in Xe so do not a
> real strong opinion on this.

Yes, that's my understanding as well. If the drivers call drm_sched_fini()
then they are responsible for cleaning up _before_ calling drm_sched_fini().
The Xe driver seems to be doing the right thing. All in all, since
drm_sched_fini() is called by the driver, the driver is supposed to have
cleaned up before calling it, so I don't see much need for this patch
after all.
--
Regards,
Luben