Re: [PATCH v7 3/3] drm/sched: Update timedout_job()'s documentation

From: Matthew Brost
Date: Fri Mar 07 2025 - 12:35:29 EST


On Fri, Mar 07, 2025 at 10:37:04AM +0100, Philipp Stanner wrote:
> On Thu, 2025-03-06 at 12:57 -0800, Matthew Brost wrote:
> > On Wed, Mar 05, 2025 at 02:05:52PM +0100, Philipp Stanner wrote:
> > > drm_sched_backend_ops.timedout_job()'s documentation is outdated.
> > > It
> > > mentions the deprecated function drm_sched_resubmit_jobs().
> > > Furthermore,
> > > it does not point out the important distinction between hardware
> > > and
> > > firmware schedulers.
> > >
> > > Since firmware schedulers typically only use one entity per
> > > scheduler,
> > > timeout handling is significantly more simple because the entity
> > > the
> > > faulted job came from can just be killed without affecting innocent
> > > processes.
> > >
> > > Update the documentation with that distinction and other details.
> > >
> > > Reformat the docstring to work to a unified style with the other
> > > handles.
> > >
> >
> > Looks really good, one suggestion.
>
> Already merged. But I'm working already on the TODO and could address
> your feedback in that followup.
>
> Of course, would also be great if you could provide a proposal in a
> patch? :)
>
> >
> > > Signed-off-by: Philipp Stanner <phasta@xxxxxxxxxx>
> > > ---
> > >  include/drm/gpu_scheduler.h | 78 ++++++++++++++++++++++-----------
> > > ----
> > >  1 file changed, 47 insertions(+), 31 deletions(-)
> > >
> > > diff --git a/include/drm/gpu_scheduler.h
> > > b/include/drm/gpu_scheduler.h
> > > index 6381baae8024..1a7e377d4cbb 100644
> > > --- a/include/drm/gpu_scheduler.h
> > > +++ b/include/drm/gpu_scheduler.h
> > > @@ -383,8 +383,15 @@ struct drm_sched_job {
> > >   struct xarray dependencies;
> > >  };
> > >  
> > > +/**
> > > + * enum drm_gpu_sched_stat - the scheduler's status
> > > + *
> > > + * @DRM_GPU_SCHED_STAT_NONE: Reserved. Do not use.
> > > + * @DRM_GPU_SCHED_STAT_NOMINAL: Operation succeeded.
> > > + * @DRM_GPU_SCHED_STAT_ENODEV: Error: Device is not available
> > > anymore.
> > > + */
> > >  enum drm_gpu_sched_stat {
> > > - DRM_GPU_SCHED_STAT_NONE, /* Reserve 0 */
> > > + DRM_GPU_SCHED_STAT_NONE,
> > >   DRM_GPU_SCHED_STAT_NOMINAL,
> > >   DRM_GPU_SCHED_STAT_ENODEV,
> > >  };
> > > @@ -447,43 +454,52 @@ struct drm_sched_backend_ops {
> > >   * @timedout_job: Called when a job has taken too long to
> > > execute,
> > >   * to trigger GPU recovery.
> > >   *
> > > - * This method is called in a workqueue context.
> > > + * @sched_job: The job that has timed out
> > >   *
> > > - * Drivers typically issue a reset to recover from GPU
> > > hangs, and this
> > > - * procedure usually follows the following workflow:
> > > + * Drivers typically issue a reset to recover from GPU
> > > hangs.
> > > + * This procedure looks very different depending on
> > > whether a firmware
> > > + * or a hardware scheduler is being used.
> > >   *
> > > - * 1. Stop the scheduler using drm_sched_stop(). This will
> > > park the
> > > - *    scheduler thread and cancel the timeout work,
> > > guaranteeing that
> > > - *    nothing is queued while we reset the hardware queue
> > > - * 2. Try to gracefully stop non-faulty jobs (optional)
> > > - * 3. Issue a GPU reset (driver-specific)
> > > - * 4. Re-submit jobs using drm_sched_resubmit_jobs()
> > > - * 5. Restart the scheduler using drm_sched_start(). At
> > > that point, new
> > > - *    jobs can be queued, and the scheduler thread is
> > > unblocked
> > > + * For a FIRMWARE SCHEDULER, each ring has one scheduler,
> > > and each
> > > + * scheduler has one entity. Hence, the steps taken
> > > typically look as
> > > + * follows:
> > > + *
> > > + * 1. Stop the scheduler using drm_sched_stop(). This will
> > > pause the
> > > + *    scheduler workqueues and cancel the timeout work,
> > > guaranteeing
> > > + *    that nothing is queued while the ring is being
> > > removed.
> > > + * 2. Remove the ring. The firmware will make sure that
> > > the
> > > + *    corresponding parts of the hardware are resetted,
> > > and that other
> > > + *    rings are not impacted.
> > > + * 3. Kill the entity and the associated scheduler.
> >
> > Xe doesn't do step 3.
> >
> > It does:
> > - Ban entity / scheduler so futures submissions are a NOP. This would
> > be
> >   submissions with unmet dependencies. Submission at the IOCTL are
> >   disallowed
> > - Signal all job's fences on the pending list
> > - Restart scheduler so free_job() is naturally called
> >
> > I'm unsure if this how other firmware schedulers do this, but it
> > seems
> > to work quite well in Xe.

Missed this part of the reply.

>
> Alright, so if I interpret this correctly you do that to avoid our
> infamous memory leaks. That makes sense.
>

Yes.

> The memory leaks are documented in drm_sched_fini()'s docu, but it
> could make sense to mention them here, too.
>

The jobs in Xe ref count the scheduler so we never call drm_sched_fini
until jobs in the pending list and dependency queues has made called
free_job().

> … thinking about it, we probably actually have to rephrase this line.
> Just tearing down entity & sched makes those leaks very likely. Argh.
>
> Nouveau, also a firmware scheduler, has effectively a copy of the
> pending_list and also ensures that all fences get signalled. Only once
> that copy of the pending list is empty it calls into drm_sched_fini().
> Take a look at nouveau_sched.c if you want, the code is quite
> straightforward.
>

Same idea in Xe I think we just directly access the pending access list.
Let me look at what Nouveau is doing before posting an updated doc here
patch.

Matt

> P.
>
> >
> > Matt
> >
> > > + *
> > > + *
> > > + * For a HARDWARE SCHEDULER, a scheduler instance
> > > schedules jobs from
> > > + * one or more entities to one ring. This implies that all
> > > entities
> > > + * associated with the affected scheduler cannot be torn
> > > down, because
> > > + * this would effectively also affect innocent userspace
> > > processes which
> > > + * did not submit faulty jobs (for example).
> > > + *
> > > + * Consequently, the procedure to recover with a hardware
> > > scheduler
> > > + * should look like this:
> > > + *
> > > + * 1. Stop all schedulers impacted by the reset using
> > > drm_sched_stop().
> > > + * 2. Kill the entity the faulty job stems from.
> > > + * 3. Issue a GPU reset on all faulty rings (driver-
> > > specific).
> > > + * 4. Re-submit jobs on all schedulers impacted by re-
> > > submitting them to
> > > + *    the entities which are still alive.
> > > + * 5. Restart all schedulers that were stopped in step #1
> > > using
> > > + *    drm_sched_start().
> > >   *
> > >   * Note that some GPUs have distinct hardware queues but
> > > need to reset
> > >   * the GPU globally, which requires extra synchronization
> > > between the
> > > - * timeout handler of the different &drm_gpu_scheduler.
> > > One way to
> > > - * achieve this synchronization is to create an ordered
> > > workqueue
> > > - * (using alloc_ordered_workqueue()) at the driver level,
> > > and pass this
> > > - * queue to drm_sched_init(), to guarantee that timeout
> > > handlers are
> > > - * executed sequentially. The above workflow needs to be
> > > slightly
> > > - * adjusted in that case:
> > > + * timeout handlers of different schedulers. One way to
> > > achieve this
> > > + * synchronization is to create an ordered workqueue
> > > (using
> > > + * alloc_ordered_workqueue()) at the driver level, and
> > > pass this queue
> > > + * as drm_sched_init()'s @timeout_wq parameter. This will
> > > guarantee
> > > + * that timeout handlers are executed sequentially.
> > >   *
> > > - * 1. Stop all schedulers impacted by the reset using
> > > drm_sched_stop()
> > > - * 2. Try to gracefully stop non-faulty jobs on all queues
> > > impacted by
> > > - *    the reset (optional)
> > > - * 3. Issue a GPU reset on all faulty queues (driver-
> > > specific)
> > > - * 4. Re-submit jobs on all schedulers impacted by the
> > > reset using
> > > - *    drm_sched_resubmit_jobs()
> > > - * 5. Restart all schedulers that were stopped in step #1
> > > using
> > > - *    drm_sched_start()
> > > + * Return: The scheduler's status, defined by &enum
> > > drm_gpu_sched_stat
> > >   *
> > > - * Return DRM_GPU_SCHED_STAT_NOMINAL, when all is normal,
> > > - * and the underlying driver has started or completed
> > > recovery.
> > > - *
> > > - * Return DRM_GPU_SCHED_STAT_ENODEV, if the device is no
> > > longer
> > > - * available, i.e. has been unplugged.
> > >   */
> > >   enum drm_gpu_sched_stat (*timedout_job)(struct
> > > drm_sched_job *sched_job);
> > >  
> > > --
> > > 2.48.1
> > >
>