Re: [PATCH] drm/sched: Extend and update documentation

From: Philipp Stanner
Date: Thu Dec 05 2024 - 08:20:15 EST


On Thu, 2024-11-21 at 17:05 +0100, Alice Ryhl wrote:
> On Fri, Nov 15, 2024 at 11:36 AM Philipp Stanner
> <pstanner@xxxxxxxxxx> wrote:
> >
> > The various objects defined and used by the GPU scheduler are
> > currently
> > not fully documented. Furthermore, there is no documentation yet
> > informing drivers about how they should handle timeouts.
> >
> > Add documentation describing the scheduler's objects and timeout
> > procedure. Consistently, update
> > drm_sched_backend_ops.timedout_job()'s
> > documentation.
> >
> > Co-developed-by: Christian König <christian.koenig@xxxxxxx>
> > Signed-off-by: Christian König <christian.koenig@xxxxxxx>
> > Signed-off-by: Philipp Stanner <pstanner@xxxxxxxxxx>
> > ---
> > I shamelessly stole- ahm, borrowed this documentation patch that
> > Christian had submitted a year ago:
> >
> > https://lore.kernel.org/dri-devel/20231116141547.206695-1-christian.koenig@xxxxxxx/
> >
> > I took feedback from last year into account where applicable, but
> > it's
> > probably a good idea if you all take a close look again.
> >
> > P.
> > ---
> >  Documentation/gpu/drm-mm.rst           |  36 +++++
> >  drivers/gpu/drm/scheduler/sched_main.c | 200
> > ++++++++++++++++++++++---
> >  include/drm/gpu_scheduler.h            |  16 +-
> >  3 files changed, 225 insertions(+), 27 deletions(-)
> >
> > diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-
> > mm.rst
> > index d55751cad67c..95ee95fd987a 100644
> > --- a/Documentation/gpu/drm-mm.rst
> > +++ b/Documentation/gpu/drm-mm.rst
> > @@ -556,12 +556,48 @@ Overview
> >  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> >     :doc: Overview
> >
> > +Job Object
> > +----------
> > +
> > +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> > +   :doc: Job Object
> > +
> > +Entity Object
> > +-------------
> > +
> > +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> > +   :doc: Entity Object
> > +
> > +Hardware Fence Object
> > +---------------------
> > +
> > +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> > +   :doc: Hardware Fence Object
> > +
> > +Scheduler Fence Object
> > +----------------------
> > +
> > +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> > +   :doc: Scheduler Fence Object
> > +
> > +Scheduler and Run Queue Objects
> > +-------------------------------
> > +
> > +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> > +   :doc: Scheduler and Run Queue Objects
> > +
> >  Flow Control
> >  ------------
> >
> >  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> >     :doc: Flow Control
> >
> > +Error and Timeout handling
> > +--------------------------
> > +
> > +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> > +   :doc: Error and Timeout handling
> > +
> >  Scheduler Function References
> >  -----------------------------
> >
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index e97c6c60bc96..76eb46281985 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -24,28 +24,155 @@
> >  /**
> >   * DOC: Overview
> >   *
> > - * The GPU scheduler provides entities which allow userspace to
> > push jobs
> > - * into software queues which are then scheduled on a hardware run
> > queue.
> > - * The software queues have a priority among them. The scheduler
> > selects the entities
> > - * from the run queue using a FIFO. The scheduler provides
> > dependency handling
> > - * features among jobs. The driver is supposed to provide callback
> > functions for
> > - * backend operations to the scheduler like submitting a job to
> > hardware run queue,
> > - * returning the dependencies of a job etc.
> > + * The GPU scheduler is shared infrastructure intended to help
> > drivers managing
> > + * command submission to their hardware.
> >   *
> > - * The organisation of the scheduler is the following:
> > + * To do so, it offers a set of scheduling facilities that
> > interact with the
> > + * driver through callbacks which the latter can register.
> >   *
> > - * 1. Each hw run queue has one scheduler
> > - * 2. Each scheduler has multiple run queues with different
> > priorities
> > - *    (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
> > - * 3. Each scheduler run queue has a queue of entities to schedule
> > - * 4. Entities themselves maintain a queue of jobs that will be
> > scheduled on
> > - *    the hardware.
> > + * In particular, the scheduler takes care of:
> > + *   - Ordering command submissions
> > + *   - Signalling DMA fences, e.g., for finished commands
> > + *   - Taking dependencies between command submissions into
> > account
> > + *   - Handling timeouts for command submissions
>
> For the signalling case, you say "e.g.". Does that mean it also
> signals DMA fences in other cases?

Good question – it does signal another fence when a job is being
scheduled, but that's not really relevant for the scheduler user IMO.
On the other hand, the docu further down does refer to that fence.

I think we could mention them both here.

>
> > - * The jobs in a entity are always scheduled in the order that
> > they were pushed.
> > + * All callbacks the driver needs to implement are restricted by
> > DMA-fence
> > + * signaling rules to guarantee deadlock free forward progress.
> > This especially
> > + * means that for normal operation no memory can be allocated in a
> > callback.
> > + * All memory which is needed for pushing the job to the hardware
> > must be
> > + * allocated before arming a job. It also means that no locks can
> > be taken
> > + * under which memory might be allocated as well.
> >   *
> > - * Note that once a job was taken from the entities queue and
> > pushed to the
> > - * hardware, i.e. the pending queue, the entity must not be
> > referenced anymore
> > - * through the jobs entity pointer.
> > + * Memory which is optional to allocate, for example for device
> > core dumping or
> > + * debugging, *must* be allocated with GFP_NOWAIT and appropriate
> > error
> > + * handling if that allocation fails. GFP_ATOMIC should only be
> > used if
> > + * absolutely necessary since dipping into the special atomic
> > reserves is
> > + * usually not justified for a GPU driver.
> > + *
> > + * Note especially the following about the scheduler's historic
> > background that
> > + * lead to sort of a double role it plays today:
> > + *
> > + * In classic setups N entities share one scheduler, and the
> > scheduler decides
> > + * which job to pick from which entity and move it to the hardware
> > ring next
> > + * (that is: "scheduling").
> > + *
> > + * Many (especially newer) GPUs, however, can have an almost
> > arbitrary number
> > + * of hardware rings and it's a firmware scheduler which actually
> > decides which
> > + * job will run next. In such setups, the GPU scheduler is still
> > used (e.g., in
> > + * Nouveau) but does not "schedule" jobs in the classical sense
> > anymore. It
> > + * merely serves to queue and dequeue jobs and resolve
> > dependencies. In such a
> > + * scenario, it is recommended to have one scheduler per entity.
> > + */
> > +
> > +/**
> > + * DOC: Job Object
> > + *
> > + * The base job object (drm_sched_job) contains submission
> > dependencies in the
> > + * form of DMA-fence objects. Drivers can also implement an
> > optional
> > + * prepare_job callback which returns additional dependencies as
> > DMA-fence
> > + * objects. It's important to note that this callback can't
> > allocate memory or
> > + * grab locks under which memory is allocated.
> > + *
> > + * Drivers should use this as base class for an object which
> > contains the
> > + * necessary state to push the command submission to the hardware.
> > + *
> > + * The lifetime of the job object needs to last at least from
> > submitting it to
> > + * the scheduler (through drm_sched_job_arm()) until the scheduler
> > has invoked
> > + * drm_sched_backend_ops.free_job() and, thereby, has indicated
> > that it does
> > + * not need the job anymore. Drivers can of course keep their job
> > object alive
> > + * for longer than that, but that's outside of the scope of the
> > scheduler
> > + * component.
> > + *
> > + * Job initialization is split into two stages:
> > + *   1. drm_sched_job_init() which serves for basic preparation of
> > a job.
> > + *      Drivers don't have to be mindful of this function's
> > consequences and
> > + *      its effects can be reverted through
> > drm_sched_job_cleanup().
> > + *   2. drm_sched_job_arm() which irrevokably arms a job for
> > execution. This
> > + *      activates the job's fence, i.e., it registers the
> > callbacks. Thus,
> > + *      inevitably, the callbacks will access the job and its
> > memory at some
> > + *      point in the future. This means that once
> > drm_sched_job_arm() has been
> > + *      called, the job structure has to be valid until the
> > scheduler invoked
> > + *      drm_sched_backend_ops.free_job().
>
> This is written as-if there could be multiple callbacks in a single
> job. Is that the case?
>

No, arm() just arms the software fence that will fire after the
scheduler is done with the job and calls free_job().

Will adjust.

>
>
>
> Also typo: "invoked" -> "invokes".

Past tense actually was done on purpose, but I think present tense is
better indeed, because the validity ends once the function starts
executing.

>
> > + * It's important to note that after arming a job drivers must
> > follow the
> > + * DMA-fence rules and can't easily allocate memory or takes locks
> > under which
> > + * memory is allocated.
>
> comma? "job, drivers"
> typo: "or takes" -> "or take"

ACK

>
> > +
> > +/**
> > + * DOC: Entity Object
> > + *
> > + * The entity object (drm_sched_entity) which is a container for
> > jobs which
> > + * should execute sequentially. Drivers should create an entity
> > for each
> > + * individual context they maintain for command submissions which
> > can run in
> > + * parallel.
>
> This is a bit awkward, how about: "The entity object is a container
> for jobs that should execute sequentially."

ACK

>
> > + * The lifetime of the entity *should not* exceed the lifetime of
> > the
> > + * userspace process it was created for and drivers should call
> > the
> > + * drm_sched_entity_flush() function from their
> > file_operations.flush()
> > + * callback. It is possible that an entity object is not alive
> > anymore
> > + * while jobs previously fetched from it are still running on the
> > hardware.
>
> To be clear ... this is about not letting processes run code after
> dying, and not because something you're using gets freed after
> flush(), correct?

Not sure if I can fully follow, but this section is not really about
preventing processes from anything. It's just that an entity is created
as a job-container for a specific process, so it doesn't make sense
that the entity lives longer than the process.

If the driver wouldn't ensure that it would simply mean that the
scheduler keeps pulling jobs from that entity, but since the userspace
process is gone by then it wouldn't really matter. I'm not aware of
this really being harmful, as long as the entity gets cleaned up at
some point.

>
> > + * This is done because all results of a command submission should
> > become
> > + * visible externally even after a process exits. This is normal
> > POSIX
> > + * behavior for I/O operations.
> > + *
> > + * The problem with this approach is that GPU submissions contain
> > executable
> > + * shaders enabling processes to evade their termination by
> > offloading work to
> > + * the GPU. So when a process is terminated with a SIGKILL the
> > entity object
> > + * makes sure that jobs are freed without running them while still
> > maintaining
> > + * correct sequential order for signaling fences.
> > + */
> > +
> > +/**
> > + * DOC: Hardware Fence Object
> > + *
> > + * The hardware fence object is a DMA-fence provided by the driver
> > as result of
> > + * running jobs. Drivers need to make sure that the normal DMA-
> > fence semantics
> > + * are followed for this object. It's important to note that the
> > memory for
> > + * this object can *not* be allocated in
> > drm_sched_backend_ops.run_job() since
> > + * that would violate the requirements for the DMA-fence
> > implementation. The
> > + * scheduler maintains a timeout handler which triggers if this
> > fence doesn't
> > + * signal within a configurable amount of time.
> > + *
> > + * The lifetime of this object follows DMA-fence refcounting
> > rules. The
> > + * scheduler takes ownership of the reference returned by the
> > driver and
> > + * drops it when it's not needed any more.
> > + */
> > +
> > +/**
> > + * DOC: Scheduler Fence Object
> > + *
> > + * The scheduler fence object (drm_sched_fence) which encapsulates
> > the whole
> > + * time from pushing the job into the scheduler until the hardware
> > has finished
> > + * processing it. This is internally managed by the scheduler, but
> > drivers can
> > + * grab additional reference to it after arming a job. The
> > implementation
> > + * provides DMA-fence interfaces for signaling both scheduling of
> > a command
> > + * submission as well as finishing of processing.
>
> typo: "an additional reference" or "additional references"
>
> > + * The lifetime of this object also follows normal DMA-fence
> > refcounting rules.
> > + * The finished fence is the one normally exposed to the outside
> > world, but the
> > + * driver can grab references to both the scheduled as well as the
> > finished
> > + * fence when needed for pipelining optimizations.
>
> When you refer to the "scheduled fence" and the "finished fence",
> these are referring to "a fence indicating when the job was scheduled
> / finished", rather than "a fence which was scheduled for execution
> and has now become finished", correct? I think the wording could be a
> bit clearer here.

If anything it would have to be "A fence signaling once a job has been
scheduled".

I'll try to come up with a better wording

Thx,
P.


>
> Alice
>