Re: [PATCH] drm/sched: Remove racy hack from drm_sched_fini()

From: Philipp Stanner

Date: Tue Feb 17 2026 - 05:27:57 EST

On Thu, 2026-01-08 at 09:30 +0100, Philipp Stanner wrote:
> drm_sched_fini() contained a hack to work around a race in amdgpu.
> According to AMD, the hack should not be necessary anymore. In case
> there should have been undetected users,
>
> commit 975ca62a014c ("drm/sched: Add warning for removing hack in drm_sched_fini()")
>
> had added a warning one release cycle ago.
>
> Thus, it can be derived that the hack can be savely removed by now.
>
> Remove the hack.
>
> Signed-off-by: Philipp Stanner <phasta@xxxxxxxxxx>
> ---
> As hinted at in the commit, I want to cozyly queue this one up for the
> next merge window, since we're printing that warning since last merge
> window already.
>
> If someone has concerns I'm also happy to delay this patch for a few
> more releases.
> ---

Any objections by anyone?

Can I get an RB?

P.

> drivers/gpu/drm/scheduler/sched_main.c | 38 +-------------------------
> 1 file changed, 1 insertion(+), 37 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 1d4f1b822e7b..381c1694a12e 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1416,48 +1416,12 @@ static void drm_sched_cancel_remaining_jobs(struct drm_gpu_scheduler *sched)
> */
> void drm_sched_fini(struct drm_gpu_scheduler *sched)
> {
> - struct drm_sched_entity *s_entity;
> int i;
>
> drm_sched_wqueue_stop(sched);
>
> - for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
> - struct drm_sched_rq *rq = sched->sched_rq[i];
> -
> - spin_lock(&rq->lock);
> - list_for_each_entry(s_entity, &rq->entities, list) {
> - /*
> - * Prevents reinsertion and marks job_queue as idle,
> - * it will be removed from the rq in drm_sched_entity_fini()
> - * eventually
> - *
> - * FIXME:
> - * This lacks the proper spin_lock(&s_entity->lock) and
> - * is, therefore, a race condition. Most notably, it
> - * can race with drm_sched_entity_push_job(). The lock
> - * cannot be taken here, however, because this would
> - * lead to lock inversion -> deadlock.
> - *
> - * The best solution probably is to enforce the life
> - * time rule of all entities having to be torn down
> - * before their scheduler. Then, however, locking could
> - * be dropped alltogether from this function.
> - *
> - * For now, this remains a potential race in all
> - * drivers that keep entities alive for longer than
> - * the scheduler.
> - *
> - * The READ_ONCE() is there to make the lockless read
> - * (warning about the lockless write below) slightly
> - * less broken...
> - */
> - if (!READ_ONCE(s_entity->stopped))
> - dev_warn(sched->dev, "Tearing down scheduler with active entities!\n");
> - s_entity->stopped = true;
> - }
> - spin_unlock(&rq->lock);
> + for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++)
> kfree(sched->sched_rq[i]);
> - }
>
> /* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */
> wake_up_all(&sched->job_scheduled);