Re: sched/fair: DELAY_DEQUEUE causes ~25% pipe IPC regression on Raspberry Pi 5
From: Vincent Guittot
Date: Mon Apr 20 2026 - 11:31:37 EST
Hi Tom,
On Sat, 18 Apr 2026 at 18:13, Tom Gebhardt <tomge68@xxxxxxxxx> wrote:
>
> Hi Vincent,
>
> thank you for the questions — here are the results.
>
> Hardware: Raspberry Pi 5 (BCM2712 C1), 8 GB, arm64, overclocked (arm_freq=2800).
> Benchmark: stress-ng 0.15.06, --pipe 4 --timeout 20s
> Measured with: perf stat -e sched:sched_migrate_task,sched:sched_switch
>
> Results
> -------
>
> Kernel pipe bogo ops/s sched_migrate_task sched_switch
> 6.6.78 2 339 827 98 21 151 454
> 6.12.81 1 163 540 4 39 911 525
>
> Key observations:
>
> 1. sched_migrate_task is much lower in 6.12 (4 vs 98). Producer and
> consumer are not migrating — they stay on their assigned CPUs.
> CPU affinity sampling confirms that workers are distributed across
> all 4 cores in both kernels with no significant difference in
> co-location.
This is somewhat normal because delayed dequeue tasks are requeued on
prev rq without calling select_task_rq
>
> 2. sched_switch is nearly double in 6.12 (+89%, 39.9M vs 21.2M). The
> pipe worker pairs switch context far more often without doing more
> useful work.
This is most probably our main problem. I will check why we have so
many context switches with delayed dequeues.
I think that I have been able to reproduce the regression with a
regression of 19% on my board for 6.12.82 vs 6.6.185 and 3 times more
context switch.
The drop is only 7% with tip/sched/core
Regards,
Vincent
>
> 3. usr time drops significantly in 6.12 (11.7s vs 20.3s user time
> across 4 workers), while sys time increases (67.4s vs 57.7s). The
> workers spend less time in
> userspace and more time in the kernel per iteration.
>
> This points to DELAY_DEQUEUE causing the sleeping consumer to remain
> on the runqueue longer, generating additional scheduling overhead per
> pipe iteration without improving throughput — consistent with the ~50%
> regression in bogo ops/s.
>
> Happy to run additional traces (e.g. perf sched or ftrace) if that would help.
>
> Best regards,
> Tom
>
> Am Fr., 17. Apr. 2026 um 17:42 Uhr schrieb Vincent Guittot
> <vincent.guittot@xxxxxxxxxx>:
> >
> > On Thu, 16 Apr 2026 at 14:23, Tom Gebhardt <tomge68@xxxxxxxxx> wrote:
> > >
> > > Hi Peter,
> > >
> > > I would like to report a measurable pipe IPC throughput regression introduced by
> > > commit 152e11f ("sched/fair: Implement delayed dequeue"), first
> > > present in v6.12-rc1.
> > >
> > > This has been independently confirmed on the official Raspberry Pi
> > > Linux issue tracker
> > > (raspberrypi/linux #7308), where the RPi kernel team directed the
> > > issue upstream.
> > >
> > >
> > > Hardware / Software
> > > -------------------
> > > - Raspberry Pi 5 Model B, BCM2712 (C1 stepping), 8 GB RAM
> > > - Raspberry Pi OS Bookworm (arm64)
> > > - Kernels tested: 6.6.78-v8-16k+ (rpi-6.6.y), 6.12.75+rpt-rpi-2712,
> > > 6.12.81-v8-16k+ (custom)
> > > - Benchmark: stress-ng 0.15.06, --pipe 4 --timeout 20s --metrics-brief
> > >
> > >
> > > Observed regression
> > > -------------------
> > > Comparing pipe IPC throughput across kernels (overclocked, arm_freq=2800):
> > >
> > > Kernel pipe bogo ops/s vs. 6.6
> > > 6.6.78 2 487 746 100%
> > > 6.12.75 1 651 427 -34%
> > > 6.18.21 2 049 701 -18%
> > >
> > > This regression pattern is consistent across two separate Raspberry Pi
> > > 5 units and has
> > > been independently reproduced by the RPi kernel team with 20-run averages:
> > > 6.6=2065 Kops/s, 6.12=1662, 6.18=1805, 7.0=1570 (lowest).
> > >
> > >
> > > Runtime isolation via CONFIG_SCHED_DEBUG
> > > -----------------------------------------
> > > To isolate the root cause, I compiled a custom kernel (rpi-6.12.y,
> > > 6.12.81-v8-16k+)
> > > with CONFIG_SCHED_DEBUG=y and toggled scheduler features at runtime:
> > >
> > > DELAY_DEQUEUE PREEMPT_SHORT pipe bogo ops/s vs. baseline
> > > on (default) on (default) 1 506 572 --
> > > OFF on 2 125 473 +41% <==
> > > on OFF 1 419 026 -6%
> > > OFF OFF 2 078 182 +38%
> > >
> > > Disabling DELAY_DEQUEUE alone recovers +41% throughput, almost closing
> > > the gap to 6.6.
> > > Disabling PREEMPT_SHORT alone has no positive effect on this workload.
> > >
> > > The remaining gap to 6.6 (~15%) is likely CONFIG_SCHED_DEBUG=y overhead.
> > >
> > >
> > > Root cause analysis
> > > -------------------
> > > The pipe producer-consumer loop is affected by DELAY_DEQUEUE as follows:
> > >
> > > Before DELAY_DEQUEUE:
> > > consumer reads empty pipe -> blocks -> dequeue_task() removes it from runqueue
> > > producer writes -> wake_up_interruptible() -> consumer re-enqueued
> > > cleanly -> runs
> > >
> > > With DELAY_DEQUEUE (v6.12+):
> > > consumer reads empty pipe -> blocks -> stays on runqueue (sched_delayed = 1)
> > > producer writes -> wakeup path handles already-queued task ->
> > > additional bookkeeping
> >
> > because the task is still enqueued, the enqueue of delayed entity
> > should be faster most of the time
> >
> > > per iteration
> > >
> > > For a tight 4-worker pipe benchmark at millions of iterations, this
> > > per-iteration
> > > overhead compounds directly into measured throughput.
> > >
> > > PREEMPT_SHORT (commit 85e511d) does not contribute to this regression.
> > > Its stated
> >
> > Unless you set a custom slice you should not see any difference with this patch
> >
> > > trade-off ("massive_intr workload gets more context switches") does
> > > not appear to be
> > > the bottleneck here.
> > >
> > >
> > > Mitigations tested and ruled out
> > > ---------------------------------
> > > - Spectre mitigations: mitigations=off yields only +0.5-2.5%
> > > improvement (confirmed by
> > > RPi kernel team). Not the cause.
> > > - CPU governor: tested with both ondemand and performance. No
> > > significant difference.
> > >
> > >
> > > References
> > > ----------
> > > - Commit 152e11f (DELAY_DEQUEUE):
> > > https://github.com/torvalds/linux/commit/152e11f6df293e816a6a37c69757033cdc72667d
> > > - Commit 85e511d (PREEMPT_SHORT):
> > > https://github.com/torvalds/linux/commit/85e511df3cec46021024176672a748008ed135bf
> > > - RPi issue tracker: https://github.com/raspberrypi/linux/issues/7308
> > >
> > > Please let me know if additional data (perf traces, full benchmark
> > > logs, kernel config)
> > > would be helpful. I am happy to run further tests on the hardware.
> >
> > Could you trace and check that:
> > - consumer and producer of one pipe are on the same cpu
> > - if there is a diff in the number of migration
> >
> > I will try to reproduce locally once I get access to my hardware
> >
> > >
> > > Thank you for your work on the scheduler.
> > >
> > > Best regards,
> > > Thomas Gebhardt (@Kletternaut on GitHub)
> > >