Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer

From: Boris Brezillon

Date: Tue Mar 24 2026 - 05:34:37 EST

On Mon, 23 Mar 2026 11:38:06 -0700
Matthew Brost <matthew.brost@xxxxxxxxx> wrote:

>
> Ok, getting stats is easier than I thought...
>
> ./perf stat -a -e context-switches,cpu-migrations,task-clock,cycles,instructions /home/mbrost/xe/source/drivers.gpu.i915.igt-gpu-tools/build/tests/xe_exec_threads --r threads-basic
>
> This test creates one thread per engine instance (7 instances this BMG
> device) and submits 1k exec IOCTLs per thread, each performing a DW
> write. Each exec IOCTL typically does not have unsignaled input dependencies.
>
> With IRQ putting of jobs off + no bypass (drm_dep_queue_flags = 0):
>
> 8,449 context-switches
> 412 cpu-migrations
> 2,531.43 msec task-clock
> 1,847,846,588 cpu_atom/cycles/
> 1,847,856,947 cpu_core/cycles/
> <not supported> cpu_atom/instructions/
> 460,744,020 cpu_core/instructions/
>
> With IRQ putting of jobs off + bypass (drm_dep_queue_flags =
> DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED):
>
> 8,655 context-switches
> 229 cpu-migrations
> 2,571.33 msec task-clock
> 855,900,607 cpu_atom/cycles/
> 855,900,272 cpu_core/cycles/
> <not supported> cpu_atom/instructions/
> 403,651,469 cpu_core/instructions/
>
> With IRQ putting of jobs on + bypass (drm_dep_queue_flags =
> DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED |
> DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE):
>
> 5,361 context-switches
> 169 cpu-migrations
> 2,577.44 msec task-clock
> 685,769,153 cpu_atom/cycles/
> 685,768,407 cpu_core/cycles/
> <not supported> cpu_atom/instructions/
> 321,336,297 cpu_core/instructions/

Thanks for sharing those numbers. For completeness, can you also add the
"With IRQ putting of jobs on + no bypass" case?

I'm a bit surprised by the difference in number of context switches
given I'd expect the local-CPU to be picked in priority, and so queuing
work items on the same wq from another work item to be almost free in
term on scheduling. But I guess there's some load-balancing happening
when you execute jobs at such a high rate.

Also, I don't know if that's just noise or if it's reproducible, but
task-clock seems to be ~40usec lower with the deferred cleanup and
no-bypass (higher throughput because you're not blocking the dequeuing
of the next job on the cleanup of the previous one, I suspect).