Re: [PATCH v7 3/3] drm/sched: Update timedout_job()'s documentation
From: Philipp Stanner
Date: Fri Mar 07 2025 - 04:37:20 EST
On Thu, 2025-03-06 at 12:57 -0800, Matthew Brost wrote:
> On Wed, Mar 05, 2025 at 02:05:52PM +0100, Philipp Stanner wrote:
> > drm_sched_backend_ops.timedout_job()'s documentation is outdated.
> > It
> > mentions the deprecated function drm_sched_resubmit_jobs().
> > Furthermore,
> > it does not point out the important distinction between hardware
> > and
> > firmware schedulers.
> >
> > Since firmware schedulers typically only use one entity per
> > scheduler,
> > timeout handling is significantly more simple because the entity
> > the
> > faulted job came from can just be killed without affecting innocent
> > processes.
> >
> > Update the documentation with that distinction and other details.
> >
> > Reformat the docstring to work to a unified style with the other
> > handles.
> >
>
> Looks really good, one suggestion.
Already merged. But I'm working already on the TODO and could address
your feedback in that followup.
Of course, would also be great if you could provide a proposal in a
patch? :)
>
> > Signed-off-by: Philipp Stanner <phasta@xxxxxxxxxx>
> > ---
> > include/drm/gpu_scheduler.h | 78 ++++++++++++++++++++++-----------
> > ----
> > 1 file changed, 47 insertions(+), 31 deletions(-)
> >
> > diff --git a/include/drm/gpu_scheduler.h
> > b/include/drm/gpu_scheduler.h
> > index 6381baae8024..1a7e377d4cbb 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -383,8 +383,15 @@ struct drm_sched_job {
> > struct xarray dependencies;
> > };
> >
> > +/**
> > + * enum drm_gpu_sched_stat - the scheduler's status
> > + *
> > + * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
> > + * @DRM_GPU_SCHED_STAT_NOMINAL: Operation succeeded.
> > + * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available
> > anymore.
> > + */
> > enum drm_gpu_sched_stat {
> > - DRM_GPU_SCHED_STAT_NONE, /* Reserve 0 */
> > + DRM_GPU_SCHED_STAT_NONE,
> > DRM_GPU_SCHED_STAT_NOMINAL,
> > DRM_GPU_SCHED_STAT_ENODEV,
> > };
> > @@ -447,43 +454,52 @@ struct drm_sched_backend_ops {
> > * @timedout_job: Called when a job has taken too long to
> > execute,
> > * to trigger GPU recovery.
> > *
> > - * This method is called in a workqueue context.
> > + * @sched_job: The job that has timed out
> > *
> > - * Drivers typically issue a reset to recover from GPU
> > hangs, and this
> > - * procedure usually follows the following workflow:
> > + * Drivers typically issue a reset to recover from GPU
> > hangs.
> > + * This procedure looks very different depending on
> > whether a firmware
> > + * or a hardware scheduler is being used.
> > *
> > - * 1. Stop the scheduler using drm_sched_stop(). This will
> > park the
> > - * scheduler thread and cancel the timeout work,
> > guaranteeing that
> > - * nothing is queued while we reset the hardware queue
> > - * 2. Try to gracefully stop non-faulty jobs (optional)
> > - * 3. Issue a GPU reset (driver-specific)
> > - * 4. Re-submit jobs using drm_sched_resubmit_jobs()
> > - * 5. Restart the scheduler using drm_sched_start(). At
> > that point, new
> > - * jobs can be queued, and the scheduler thread is
> > unblocked
> > + * For a FIRMWARE SCHEDULER, each ring has one scheduler,
> > and each
> > + * scheduler has one entity. Hence, the steps taken
> > typically look as
> > + * follows:
> > + *
> > + * 1. Stop the scheduler using drm_sched_stop(). This will
> > pause the
> > + * scheduler workqueues and cancel the timeout work,
> > guaranteeing
> > + * that nothing is queued while the ring is being
> > removed.
> > + * 2. Remove the ring. The firmware will make sure that
> > the
> > + * corresponding parts of the hardware are resetted,
> > and that other
> > + * rings are not impacted.
> > + * 3. Kill the entity and the associated scheduler.
>
> Xe doesn't do step 3.
>
> It does:
> - Ban entity / scheduler so futures submissions are a NOP. This would
> be
> submissions with unmet dependencies. Submission at the IOCTL are
> disallowed
> - Signal all job's fences on the pending list
> - Restart scheduler so free_job() is naturally called
>
> I'm unsure if this how other firmware schedulers do this, but it
> seems
> to work quite well in Xe.
Alright, so if I interpret this correctly you do that to avoid our
infamous memory leaks. That makes sense.
The memory leaks are documented in drm_sched_fini()'s docu, but it
could make sense to mention them here, too.
… thinking about it, we probably actually have to rephrase this line.
Just tearing down entity & sched makes those leaks very likely. Argh.
Nouveau, also a firmware scheduler, has effectively a copy of the
pending_list and also ensures that all fences get signalled. Only once
that copy of the pending list is empty it calls into drm_sched_fini().
Take a look at nouveau_sched.c if you want, the code is quite
straightforward.
P.
>
> Matt
>
> > + *
> > + *
> > + * For a HARDWARE SCHEDULER, a scheduler instance
> > schedules jobs from
> > + * one or more entities to one ring. This implies that all
> > entities
> > + * associated with the affected scheduler cannot be torn
> > down, because
> > + * this would effectively also affect innocent userspace
> > processes which
> > + * did not submit faulty jobs (for example).
> > + *
> > + * Consequently, the procedure to recover with a hardware
> > scheduler
> > + * should look like this:
> > + *
> > + * 1. Stop all schedulers impacted by the reset using
> > drm_sched_stop().
> > + * 2. Kill the entity the faulty job stems from.
> > + * 3. Issue a GPU reset on all faulty rings (driver-
> > specific).
> > + * 4. Re-submit jobs on all schedulers impacted by re-
> > submitting them to
> > + * the entities which are still alive.
> > + * 5. Restart all schedulers that were stopped in step #1
> > using
> > + * drm_sched_start().
> > *
> > * Note that some GPUs have distinct hardware queues but
> > need to reset
> > * the GPU globally, which requires extra synchronization
> > between the
> > - * timeout handler of the different &drm_gpu_scheduler.
> > One way to
> > - * achieve this synchronization is to create an ordered
> > workqueue
> > - * (using alloc_ordered_workqueue()) at the driver level,
> > and pass this
> > - * queue to drm_sched_init(), to guarantee that timeout
> > handlers are
> > - * executed sequentially. The above workflow needs to be
> > slightly
> > - * adjusted in that case:
> > + * timeout handlers of different schedulers. One way to
> > achieve this
> > + * synchronization is to create an ordered workqueue
> > (using
> > + * alloc_ordered_workqueue()) at the driver level, and
> > pass this queue
> > + * as drm_sched_init()'s @timeout_wq parameter. This will
> > guarantee
> > + * that timeout handlers are executed sequentially.
> > *
> > - * 1. Stop all schedulers impacted by the reset using
> > drm_sched_stop()
> > - * 2. Try to gracefully stop non-faulty jobs on all queues
> > impacted by
> > - * the reset (optional)
> > - * 3. Issue a GPU reset on all faulty queues (driver-
> > specific)
> > - * 4. Re-submit jobs on all schedulers impacted by the
> > reset using
> > - * drm_sched_resubmit_jobs()
> > - * 5. Restart all schedulers that were stopped in step #1
> > using
> > - * drm_sched_start()
> > + * Return: The scheduler's status, defined by &enum
> > drm_gpu_sched_stat
> > *
> > - * Return DRM_GPU_SCHED_STAT_NOMINAL, when all is normal,
> > - * and the underlying driver has started or completed
> > recovery.
> > - *
> > - * Return DRM_GPU_SCHED_STAT_ENODEV, if the device is no
> > longer
> > - * available, i.e. has been unplugged.
> > */
> > enum drm_gpu_sched_stat (*timedout_job)(struct
> > drm_sched_job *sched_job);
> >
> > --
> > 2.48.1
> >