Re: drm_sched run_job and scheduling latency

From: Philipp Stanner

Date: Thu Mar 05 2026 - 03:42:29 EST

On Thu, 2026-03-05 at 09:27 +0100, Boris Brezillon wrote:
> Hi Matthew,
>
> On Wed, 4 Mar 2026 18:04:25 -0800
> Matthew Brost <matthew.brost@xxxxxxxxx> wrote:
>
> > On Wed, Mar 04, 2026 at 02:51:39PM -0800, Chia-I Wu wrote:
> > > Hi,
> > >
> > > Our system compositor (surfaceflinger on android) submits gpu jobs
> > > from a SCHED_FIFO thread to an RT gpu queue. However, because
> > > workqueue threads are SCHED_NORMAL, the scheduling latency from submit
> > > to run_job can sometimes cause frame misses. We are seeing this on
> > > panthor and xe, but the issue should be common to all drm_sched users.
> > >
> >
> > I'm going to assume that since this is a compositor, you do not pass
> > input dependencies to the page-flip job. Is that correct?
> >
> > If so, I believe we could fairly easily build an opt-in DRM sched path
> > that directly calls run_job in the exec IOCTL context (I assume this is
> > SCHED_FIFO) if the job has no dependencies.
>
> I guess by ::run_job() you mean something slightly more involved that
> checks if:
>
> - other jobs are pending
> - enough credits (AKA ringbuf space) is available
> - and probably other stuff I forgot about
>
> >
> > This would likely break some of Xe’s submission-backend assumptions
> > around mutual exclusion and ordering based on the workqueue, but that
> > seems workable. I don’t know how the Panthor code is structured or
> > whether they have similar issues.
>
> Honestly, I'm not thrilled by this fast-path/call-run_job-directly idea
> you're describing. There's just so many things we can forget that would
> lead to races/ordering issues that will end up being hard to trigger and
> debug.
>

+1

I'm not thrilled either. More like the opposite of thrilled actually.

Even if we could get that to work. This is more of a maintainability
issue.

The scheduler is full of insane performance hacks for this or that
driver. Lockless accesses, a special lockless queue only used by that
one party in the kernel (a lockless queue which is nowadays, after N
reworks, being used with a lock. Ah well).

In the past discussions Danilo and I made it clear that more major
features in _new_ patch series aimed at getting merged into drm/sched
must be preceded by cleanup work to address some of the scheduler's
major problems.

That's especially true if it's features aimed at performance buffs.

P.