Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer

From: Matthew Brost

Date: Tue Mar 17 2026 - 01:22:53 EST

On Mon, Mar 16, 2026 at 10:16:01AM +0100, Boris Brezillon wrote:
> Hi Matthew,
>
> On Sun, 15 Mar 2026 21:32:45 -0700
> Matthew Brost <matthew.brost@xxxxxxxxx> wrote:
>
> > Diverging requirements between GPU drivers using firmware scheduling
> > and those using hardware scheduling have shown that drm_gpu_scheduler is
> > no longer sufficient for firmware-scheduled GPU drivers. The technical
> > debt, lack of memory-safety guarantees, absence of clear object-lifetime
> > rules, and numerous driver-specific hacks have rendered
> > drm_gpu_scheduler unmaintainable. It is time for a fresh design for
> > firmware-scheduled GPU drivers—one that addresses all of the
> > aforementioned shortcomings.
> >
> > Add drm_dep, a lightweight GPU submission queue intended as a
> > replacement for drm_gpu_scheduler for firmware-managed GPU schedulers
> > (e.g. Xe, Panthor, AMDXDNA, PVR, Nouveau, Nova). Unlike
> > drm_gpu_scheduler, which separates the scheduler (drm_gpu_scheduler)
> > from the queue (drm_sched_entity) into two objects requiring external
> > coordination, drm_dep merges both roles into a single struct
> > drm_dep_queue. This eliminates the N:1 entity-to-scheduler mapping
> > that is unnecessary for firmware schedulers which manage their own
> > run-lists internally.
> >
> > Unlike drm_gpu_scheduler, which relies on external locking and lifetime
> > management by the driver, drm_dep uses reference counting (kref) on both
> > queues and jobs to guarantee object lifetime safety. A job holds a queue
> > reference from init until its last put, and the queue holds a job reference
> > from dispatch until the put_job worker runs. This makes use-after-free
> > impossible even when completion arrives from IRQ context or concurrent
> > teardown is in flight.
> >
> > The core objects are:
> >
> > struct drm_dep_queue - a per-context submission queue owning an
> > ordered submit workqueue, a TDR timeout workqueue, an SPSC job
> > queue, and a pending-job list. Reference counted; drivers can embed
> > it and provide a .release vfunc for RCU-safe teardown.
>
> First of, I like this idea, and actually think we should have done that
> from the start rather than trying to bend drm_sched to meet our

Yes. Tvrtko actually suggested this years ago, and in my naïveté I
rejected it. I’m eating my hat here.

> FW-assisted scheduling model. That's also the direction me and Danilo
> have been pushing for for the new JobQueue stuff in rust, so I'm glad
> to see some consensus here.
>
> Now, let's start with the usual naming nitpick :D => can't we find a
> better prefix than "drm_dep"? I think I get where "dep" comes from (the
> logic mostly takes care of job deps, and acts as a FIFO otherwise, no
> real scheduling involved). It's kinda okay for drm_dep_queue, even
> though, according to the description you've made, jobs seem to stay in
> that queue even after their deps are met, which, IMHO, is a bit
> confusing: dep_queue sounds like a queue in which jobs are placed until
> their deps are met, and then the job moves to some other queue.
>
> It gets worse for drm_dep_job, which sounds like a dep-only job, rather
> than a job that's queued to the drm_dep_queue. Same goes for
> drm_dep_fence, which I find super confusing. What this one does is just
> proxy the driver fence to provide proper isolation between GPU drivers
> and fence observers (other drivers).
>
> Since this new model is primarily designed for hardware that have
> FW-assisted scheduling, how about drm_fw_queue, drm_fw_job,
> drm_fw_job_fence?

We can bikeshed — I’m open to other names, but I believe hardware
scheduling can be built quite cleanly on top of this, so drm_fw_*
doesn’t really work either. Check out a hardware-scheduler PoC built
(today) on top of this in [1].

[1] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svn-perf-6-15-2025/-/commit/22c8aa993b5c9e4ad0c312af2f3e032273d20966

>
> >
> > struct drm_dep_job - a single unit of GPU work. Drivers embed this
> > and provide a .release vfunc. Jobs carry an xarray of input
> > dma_fence dependencies and produce a drm_dep_fence as their
> > finished fence.
> >
> > struct drm_dep_fence - a dma_fence subclass wrapping an optional
> > parent hardware fence. The finished fence is armed (sequence
> > number assigned) before submission and signals when the hardware
> > fence signals (or immediately on synchronous completion).
> >
> > Job lifecycle:
> > 1. drm_dep_job_init() - allocate and initialise; job acquires a
> > queue reference.
> > 2. drm_dep_job_add_dependency() and friends - register input fences;
> > duplicates from the same context are deduplicated.
> > 3. drm_dep_job_arm() - assign sequence number, obtain finished fence.
> > 4. drm_dep_job_push() - submit to queue.
> >
> > Submission paths under queue lock:
> > - Bypass path: if DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED is set, the
> > SPSC queue is empty, no dependencies are pending, and credits are
> > available, the job is dispatched inline on the calling thread.
>
> I've yet to look at the code, but I must admit I'm less worried about
> this fast path if it's part of a new model restricted to FW-assisted
> scheduling. I keep thinking we're not entirely covered for so called
> real-time GPU contexts that might have jobs that are not dep-free, and
> if we're going for something new, I'd really like us to consider that
> case from the start (maybe investigate if kthread_work[er] can be used
> as a replacement for workqueues, if RT priority on workqueues is not an
> option).
>

I mostly agree, and I’ll look into whether kthread_work is better
suited—if that’s the right model, it should be done up front.

But can you give a use case for real-time GPU contexts that are not
dep-free? I personally don’t know of one.

> > - Queued path: job is pushed onto the SPSC queue and the run_job
> > worker is kicked. The worker resolves remaining dependencies
> > (installing wakeup callbacks for unresolved fences) before calling
> > ops->run_job().
> >
> > Credit-based throttling prevents hardware overflow: each job declares
> > a credit cost at init time; dispatch is deferred until sufficient
> > credits are available.
> >
> > Timeout Detection and Recovery (TDR): a per-queue delayed work item
> > fires when the head pending job exceeds q->job.timeout jiffies, calling
> > ops->timedout_job(). drm_dep_queue_trigger_timeout() forces immediate
> > expiry for device teardown.
> >
> > IRQ-safe completion: queues flagged DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE
> > allow drm_dep_job_done() to be called from hardirq context (e.g. a
> > dma_fence callback). Dependency cleanup is deferred to process context
> > after ops->run_job() returns to avoid calling xa_destroy() from IRQ.
> >
> > Zombie-state guard: workers use kref_get_unless_zero() on entry and
> > bail immediately if the queue refcount has already reached zero and
> > async teardown is in flight, preventing use-after-free.
> >
> > Teardown is always deferred to a module-private workqueue (dep_free_wq)
> > so that destroy_workqueue() is never called from within one of the
> > queue's own workers. Each queue holds a drm_dev_get() reference on its
> > owning struct drm_device, released as the final step of teardown via
> > drm_dev_put(). This prevents the driver module from being unloaded
> > while any queue is still alive without requiring a separate drain API.
>
> Thanks for posting this RFC. I'll try to have a closer look at the code
> in the coming days, but given the diffstat, it might take me a bit of
> time...

I understand — I’m a firehose when I get started. Hopefully a sane one,
though.

Matt

>
> Regards,
>
> Boris