Re: [RFC PATCH] drm/sched: Fix teardown leaks with waitqueue
From: Philipp Stanner
Date: Thu Sep 05 2024 - 03:38:18 EST
On Wed, 2024-09-04 at 19:47 +0200, Simona Vetter wrote:
> On Tue, Sep 03, 2024 at 11:44:47AM +0200, Philipp Stanner wrote:
> > The GPU scheduler currently does not ensure that its pending_list
> > is
> > empty before performing various other teardown tasks in
> > drm_sched_fini().
> >
> > If there are still jobs in the pending_list, this is problematic
> > because
> > after scheduler teardown, no one will call backend_ops.free_job()
> > anymore. This would, consequently, result in memory leaks.
> >
> > One way to solves this is to implement a waitqueue that
> > drm_sched_fini()
> > blocks on until the pending_list has become empty.
> >
> > Add a waitqueue to struct drm_gpu_scheduler. Wake up waiters once
> > the
> > pending_list becomes empty. Wait in drm_sched_fini() for that to
> > happen.
> >
> > Suggested-by: Danilo Krummrich <dakr@xxxxxxxxxx>
> > Signed-off-by: Philipp Stanner <pstanner@xxxxxxxxxx>
> > ---
> > Hi all,
> >
> > since the scheduler has many stake holders, I want this solution
> > discussed as an RFC first.
> >
> > This version here has IMO the advantage (and disadvantage...) that
> > it
> > blocks infinitly if the driver messed up the clean up, so problems
> > might become more visible than the refcount solution I proposed in
> > parallel.
>
> Very quick comment because I'm heading out for the r4l conference,
> but
> maybe I can discuss this there with Danilo a bit.
>
> Maybe we should do step 0 first, and document the current rules? The
> kerneldoc isn't absolute zero anymore, but it's very, very bare-
> bones.
> Then get that acked and merged, which is a very good way to make sure
> we're actually standing on common ground.
Yes, documentation is definitely also on my TODO list. I wanted to send
out something clarifying the objects' lifetimes (based on Christian's
previous series [1]) quite soon.
>
> Then maybe step 0.5 would be to add runtime asserts to enforce the
> rules,
> like if you tear down the scheduler and there's stuff in flight, you
> splat
> on a WARN_ON.
>
> With that foundation there should be a lot clearer basis to discuss
> whether there is an issue, and what a better design could look like.
I mean, I'm quite sure that there are teardown issues. But we could
indeed make them visible first through documentation (and a FIXME tag)
and after establishing consensus through merging that go on as you
suggested.
> The
> little pondering I've done I've come up with a few more ideas along
> similar lines.
>
> One comment below, kinda unrelated.
>
> >
> > Cheers,
> > P.
> > ---
> > drivers/gpu/drm/scheduler/sched_main.c | 40
> > ++++++++++++++++++++++++++
> > include/drm/gpu_scheduler.h | 4 +++
> > 2 files changed, 44 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index 7e90c9f95611..200fa932f289 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -564,6 +564,13 @@ static void drm_sched_job_timedout(struct
> > work_struct *work)
> > * is parked at which point it's safe.
> > */
> > list_del_init(&job->list);
> > +
> > + /*
> > + * Inform tasks blocking in drm_sched_fini() that
> > it's now safe to proceed.
> > + */
> > + if (list_empty(&sched->pending_list))
> > + wake_up(&sched->job_list_waitque);
> > +
> > spin_unlock(&sched->job_list_lock);
> >
> > status = job->sched->ops->timedout_job(job);
> > @@ -584,6 +591,15 @@ static void drm_sched_job_timedout(struct
> > work_struct *work)
> > drm_sched_start_timeout_unlocked(sched);
> > }
> >
> > +static bool drm_sched_no_jobs_pending(struct drm_gpu_scheduler
> > *sched)
> > +{
> > + /*
> > + * For list_empty() to work without a lock.
> > + */
>
> So this is pretty far from the gold standard for documenting memory
> barrier semantics in lockless code. Ideally we have a comment for
> both
> sides of the barrier (you always need two, or there's no function
> barrier), pointing at each another, and explaining exactly what's
> being
> synchronized against what and how.
>
> I did years ago add a few missing barriers with that approach, see
> b0a5303d4e14 ("drm/sched: Barriers are needed for
> entity->last_scheduled"). Reading that patch again it feels a bit on
> the
> terse side of things (plus I noticed spelling issues now too, oops).
ACK, will do that properly once we role out the actual patch.
P.
>
> Cheers, Sima
>
> > + rmb();
> > + return list_empty(&sched->pending_list);
> > +}
> > +
> > /**
> > * drm_sched_stop - stop the scheduler
> > *
> > @@ -659,6 +675,12 @@ void drm_sched_stop(struct drm_gpu_scheduler
> > *sched, struct drm_sched_job *bad)
> > }
> > }
> >
> > + /*
> > + * Inform tasks blocking in drm_sched_fini() that it's now
> > safe to proceed.
> > + */
> > + if (drm_sched_no_jobs_pending(sched))
> > + wake_up(&sched->job_list_waitque);
> > +
> > /*
> > * Stop pending timer in flight as we rearm it in
> > drm_sched_start. This
> > * avoids the pending timeout work in progress to fire
> > right away after
> > @@ -1085,6 +1107,12 @@ drm_sched_get_finished_job(struct
> > drm_gpu_scheduler *sched)
> > /* remove job from pending_list */
> > list_del_init(&job->list);
> >
> > + /*
> > + * Inform tasks blocking in drm_sched_fini() that
> > it's now safe to proceed.
> > + */
> > + if (list_empty(&sched->pending_list))
> > + wake_up(&sched->job_list_waitque);
> > +
> > /* cancel this job's TO timer */
> > cancel_delayed_work(&sched->work_tdr);
> > /* make the scheduled timestamp more accurate */
> > @@ -1303,6 +1331,7 @@ int drm_sched_init(struct drm_gpu_scheduler
> > *sched,
> > init_waitqueue_head(&sched->job_scheduled);
> > INIT_LIST_HEAD(&sched->pending_list);
> > spin_lock_init(&sched->job_list_lock);
> > + init_waitqueue_head(&sched->job_list_waitque);
> > atomic_set(&sched->credit_count, 0);
> > INIT_DELAYED_WORK(&sched->work_tdr,
> > drm_sched_job_timedout);
> > INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
> > @@ -1333,12 +1362,23 @@ EXPORT_SYMBOL(drm_sched_init);
> > * @sched: scheduler instance
> > *
> > * Tears down and cleans up the scheduler.
> > + *
> > + * Note that this function blocks until the fences returned by
> > + * backend_ops.run_job() have been signalled.
> > */
> > void drm_sched_fini(struct drm_gpu_scheduler *sched)
> > {
> > struct drm_sched_entity *s_entity;
> > int i;
> >
> > + /*
> > + * Jobs that have neither been scheduled or which have
> > timed out are
> > + * gone by now, but jobs that have been submitted through
> > + * backend_ops.run_job() and have not yet terminated are
> > still pending.
> > + *
> > + * Wait for the pending_list to become empty to avoid
> > leaking those jobs.
> > + */
> > + wait_event(sched->job_list_waitque,
> > drm_sched_no_jobs_pending(sched));
> > drm_sched_wqueue_stop(sched);
> >
> > for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs;
> > i++) {
> > diff --git a/include/drm/gpu_scheduler.h
> > b/include/drm/gpu_scheduler.h
> > index 5acc64954a88..bff092784405 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -29,6 +29,7 @@
> > #include <linux/completion.h>
> > #include <linux/xarray.h>
> > #include <linux/workqueue.h>
> > +#include <linux/wait.h>
> >
> > #define MAX_WAIT_SCHED_ENTITY_Q_EMPTY msecs_to_jiffies(1000)
> >
> > @@ -503,6 +504,8 @@ struct drm_sched_backend_ops {
> > * timeout interval is over.
> > * @pending_list: the list of jobs which are currently in the job
> > queue.
> > * @job_list_lock: lock to protect the pending_list.
> > + * @job_list_waitque: a waitqueue for drm_sched_fini() to block on
> > until all
> > + * pending jobs have been finished.
> > * @hang_limit: once the hangs by a job crosses this limit then it
> > is marked
> > * guilty and it will no longer be considered for
> > scheduling.
> > * @score: score to help loadbalancer pick a idle sched
> > @@ -532,6 +535,7 @@ struct drm_gpu_scheduler {
> > struct delayed_work work_tdr;
> > struct list_head pending_list;
> > spinlock_t job_list_lock;
> > + wait_queue_head_t job_list_waitque;
> > int hang_limit;
> > atomic_t *score;
> > atomic_t _score;
> > --
> > 2.46.0
> >
>