Re: [RFC PATCH] drm/sched: Fix teardown leaks with waitqueue

From: Philipp Stanner
Date: Thu Sep 05 2024 - 03:38:18 EST


On Wed, 2024-09-04 at 19:47 +0200, Simona Vetter wrote:
> On Tue, Sep 03, 2024 at 11:44:47AM +0200, Philipp Stanner wrote:
> > The GPU scheduler currently does not ensure that its pending_list
> > is
> > empty before performing various other teardown tasks in
> > drm_sched_fini().
> >
> > If there are still jobs in the pending_list, this is problematic
> > because
> > after scheduler teardown, no one will call backend_ops.free_job()
> > anymore. This would, consequently, result in memory leaks.
> >
> > One way to solves this is to implement a waitqueue that
> > drm_sched_fini()
> > blocks on until the pending_list has become empty.
> >
> > Add a waitqueue to struct drm_gpu_scheduler. Wake up waiters once
> > the
> > pending_list becomes empty. Wait in drm_sched_fini() for that to
> > happen.
> >
> > Suggested-by: Danilo Krummrich <dakr@xxxxxxxxxx>
> > Signed-off-by: Philipp Stanner <pstanner@xxxxxxxxxx>
> > ---
> > Hi all,
> >
> > since the scheduler has many stake holders, I want this solution
> > discussed as an RFC first.
> >
> > This version here has IMO the advantage (and disadvantage...) that
> > it
> > blocks infinitly if the driver messed up the clean up, so problems
> > might become more visible than the refcount solution I proposed in
> > parallel.
>
> Very quick comment because I'm heading out for the r4l conference,
> but
> maybe I can discuss this there with Danilo a bit.
>
> Maybe we should do step 0 first, and document the current rules? The
> kerneldoc isn't absolute zero anymore, but it's very, very bare-
> bones.
> Then get that acked and merged, which is a very good way to make sure
> we're actually standing on common ground.

Yes, documentation is definitely also on my TODO list. I wanted to send
out something clarifying the objects' lifetimes (based on Christian's
previous series [1]) quite soon.

>
> Then maybe step 0.5 would be to add runtime asserts to enforce the
> rules,
> like if you tear down the scheduler and there's stuff in flight, you
> splat
> on a WARN_ON.
>
> With that foundation there should be a lot clearer basis to discuss
> whether there is an issue, and what a better design could look like.

I mean, I'm quite sure that there are teardown issues. But we could
indeed make them visible first through documentation (and a FIXME tag)
and after establishing consensus through merging that go on as you
suggested.

> The
> little pondering I've done I've come up with a few more ideas along
> similar lines.
>
> One comment below, kinda unrelated.
>
> >
> > Cheers,
> > P.
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 40
> > ++++++++++++++++++++++++++
> >  include/drm/gpu_scheduler.h            |  4 +++
> >  2 files changed, 44 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index 7e90c9f95611..200fa932f289 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -564,6 +564,13 @@ static void drm_sched_job_timedout(struct
> > work_struct *work)
> >   * is parked at which point it's safe.
> >   */
> >   list_del_init(&job->list);
> > +
> > + /*
> > + * Inform tasks blocking in drm_sched_fini() that
> > it's now safe to proceed.
> > + */
> > + if (list_empty(&sched->pending_list))
> > + wake_up(&sched->job_list_waitque);
> > +
> >   spin_unlock(&sched->job_list_lock);
> >  
> >   status = job->sched->ops->timedout_job(job);
> > @@ -584,6 +591,15 @@ static void drm_sched_job_timedout(struct
> > work_struct *work)
> >   drm_sched_start_timeout_unlocked(sched);
> >  }
> >  
> > +static bool drm_sched_no_jobs_pending(struct drm_gpu_scheduler
> > *sched)
> > +{
> > + /*
> > + * For list_empty() to work without a lock.
> > + */
>
> So this is pretty far from the gold standard for documenting memory
> barrier semantics in lockless code. Ideally we have a comment for
> both
> sides of the barrier (you always need two, or there's no function
> barrier), pointing at each another, and explaining exactly what's
> being
> synchronized against what and how.
>
> I did years ago add a few missing barriers with that approach, see
> b0a5303d4e14 ("drm/sched: Barriers are needed for
> entity->last_scheduled"). Reading that patch again it feels a bit on
> the
> terse side of things (plus I noticed spelling issues now too, oops).

ACK, will do that properly once we role out the actual patch.


P.

>
> Cheers, Sima
>
> > + rmb();
> > + return list_empty(&sched->pending_list);
> > +}
> > +
> >  /**
> >   * drm_sched_stop - stop the scheduler
> >   *
> > @@ -659,6 +675,12 @@ void drm_sched_stop(struct drm_gpu_scheduler
> > *sched, struct drm_sched_job *bad)
> >   }
> >   }
> >  
> > + /*
> > + * Inform tasks blocking in drm_sched_fini() that it's now
> > safe to proceed.
> > + */
> > + if (drm_sched_no_jobs_pending(sched))
> > + wake_up(&sched->job_list_waitque);
> > +
> >   /*
> >   * Stop pending timer in flight as we rearm it in 
> > drm_sched_start. This
> >   * avoids the pending timeout work in progress to fire
> > right away after
> > @@ -1085,6 +1107,12 @@ drm_sched_get_finished_job(struct
> > drm_gpu_scheduler *sched)
> >   /* remove job from pending_list */
> >   list_del_init(&job->list);
> >  
> > + /*
> > + * Inform tasks blocking in drm_sched_fini() that
> > it's now safe to proceed.
> > + */
> > + if (list_empty(&sched->pending_list))
> > + wake_up(&sched->job_list_waitque);
> > +
> >   /* cancel this job's TO timer */
> >   cancel_delayed_work(&sched->work_tdr);
> >   /* make the scheduled timestamp more accurate */
> > @@ -1303,6 +1331,7 @@ int drm_sched_init(struct drm_gpu_scheduler
> > *sched,
> >   init_waitqueue_head(&sched->job_scheduled);
> >   INIT_LIST_HEAD(&sched->pending_list);
> >   spin_lock_init(&sched->job_list_lock);
> > + init_waitqueue_head(&sched->job_list_waitque);
> >   atomic_set(&sched->credit_count, 0);
> >   INIT_DELAYED_WORK(&sched->work_tdr,
> > drm_sched_job_timedout);
> >   INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
> > @@ -1333,12 +1362,23 @@ EXPORT_SYMBOL(drm_sched_init);
> >   * @sched: scheduler instance
> >   *
> >   * Tears down and cleans up the scheduler.
> > + *
> > + * Note that this function blocks until the fences returned by
> > + * backend_ops.run_job() have been signalled.
> >   */
> >  void drm_sched_fini(struct drm_gpu_scheduler *sched)
> >  {
> >   struct drm_sched_entity *s_entity;
> >   int i;
> >  
> > + /*
> > + * Jobs that have neither been scheduled or which have
> > timed out are
> > + * gone by now, but jobs that have been submitted through
> > + * backend_ops.run_job() and have not yet terminated are
> > still pending.
> > + *
> > + * Wait for the pending_list to become empty to avoid
> > leaking those jobs.
> > + */
> > + wait_event(sched->job_list_waitque,
> > drm_sched_no_jobs_pending(sched));
> >   drm_sched_wqueue_stop(sched);
> >  
> >   for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs;
> > i++) {
> > diff --git a/include/drm/gpu_scheduler.h
> > b/include/drm/gpu_scheduler.h
> > index 5acc64954a88..bff092784405 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -29,6 +29,7 @@
> >  #include <linux/completion.h>
> >  #include <linux/xarray.h>
> >  #include <linux/workqueue.h>
> > +#include <linux/wait.h>
> >  
> >  #define MAX_WAIT_SCHED_ENTITY_Q_EMPTY msecs_to_jiffies(1000)
> >  
> > @@ -503,6 +504,8 @@ struct drm_sched_backend_ops {
> >   *            timeout interval is over.
> >   * @pending_list: the list of jobs which are currently in the job
> > queue.
> >   * @job_list_lock: lock to protect the pending_list.
> > + * @job_list_waitque: a waitqueue for drm_sched_fini() to block on
> > until all
> > + *       pending jobs have been finished.
> >   * @hang_limit: once the hangs by a job crosses this limit then it
> > is marked
> >   *              guilty and it will no longer be considered for
> > scheduling.
> >   * @score: score to help loadbalancer pick a idle sched
> > @@ -532,6 +535,7 @@ struct drm_gpu_scheduler {
> >   struct delayed_work work_tdr;
> >   struct list_head pending_list;
> >   spinlock_t job_list_lock;
> > + wait_queue_head_t job_list_waitque;
> >   int hang_limit;
> >   atomic_t                        *score;
> >   atomic_t                        _score;
> > --
> > 2.46.0
> >
>