Re: [PATCH v2 06/11] drm/panthor: Prepare the scheduler logic for FW events in IRQ context

From: Boris Brezillon

Date: Mon Jun 22 2026 - 08:51:12 EST

On Wed, 20 May 2026 15:15:54 -0700
Chia-I Wu <olvaffe@xxxxxxxxx> wrote:

> > > > I collected
> > > > some numbers with baseline, with this series, and with patch 9
> > > > reverted at https://gitlab.freedesktop.org/panfrost/linux/-/work_items/85#note_3481308.
> > > > Reposting the numbers here for reference
> > > >
> > > > | | baseline | entire series | patch 9 reverted |
> > > > | - | - | - | - |
> > > > | frag job median | 2.8ms | 2.2ms | 2.2ms |
> > > > | frag job 95% | 4.5ms | 2.8ms | 2.8ms |
> > > > | frag job 99% | 4.9ms | 2.8ms | 2.8ms |
> > > > | panthor-job median | 0.8us | 6.2us | 0.9us |
> > > > | panthor-job 95% | 1.5us | 16.6us | 1.5us |
> > > > | panthor-job 99% | 1.6us | 28.0us | 1.8us |
> > >
> > > panthor-job rows are the durations of the raw irq handlers, collected
> > > from irq/irq_handler_{entry,exit}.
> > >
> > > frag job rows are the durations from frag jobs, collected from
> > > gpu_scheduler/drm_sched_job_{run,done}.
> > >
> > > The fence signaling paths of them are
> > >
> > > - baseline: raw handler -> rt threaded handler -> wq job -> wq job ->
> > > fence signal
> > > - entire series: raw handler -> fence signal
> > > - patch 9 reverted: raw handler -> rt threaded handler -> fence signal
> >
> > Just did another set of throughput tests, and I confirm the gains are
> > noticeable only with patch 9 applied (that's on rk3588, which embeds a
> > G610, so not the exact same setup). As an example, on
> > gfxbench/gl_manhattan, I get the following score bump 2391 -> 2457.
> >
> > Now I need to set things up to measure latency like you did and make
> > sure I'm observing the same thing: threaded handlers providing roughly
> > the same latency as hardirq handlers. If not it probably has to do with
> > some config options that differ and change the preemptability of the
> > system.
> >
> > I'll hold off on the submission of v3 until this is done, because if
> > threaded handlers are roughly as efficient as hardirq ones, we probably
> > want to stick to threaded handlers.

Sorry for the delay, I only got back to this on Friday.

So, I've been using ftrace/function-graph with some noinline added to
get a sense of where most of the time was spent in the hardirq handler
after the transition to hardirqs, and unlike what I thought, it's not
coming from the accesses to uncached mappings of the FW
interface/syncobjs, but instead the various queue[_delayed]_work()
and/or wake_up_all() on panthor_fw::req_waitqueue. I don't expect us to
be able to optimize that anytime soon, so I guess we should just keep
everything in the threaded handler for now and accept the extra delay
(assuming 20+ usec for the hardirq handler is too long). This also
means that a lot of the things I do in this series are moot
(irqsave/restore, using spinlocks instead of mutexes, ...), but before
I go and rework that, I'd like to get some feedback from Steve and
Liviu to make sure this is okay with Arm.