Re: sched/fair: DELAY_DEQUEUE causes ~25% pipe IPC regression on Raspberry Pi 5

From: Tom Gebhardt

Date: Sat Apr 18 2026 - 12:13:10 EST


Hi Vincent,

thank you for the questions — here are the results.

Hardware: Raspberry Pi 5 (BCM2712 C1), 8 GB, arm64, overclocked (arm_freq=2800).
Benchmark: stress-ng 0.15.06, --pipe 4 --timeout 20s
Measured with: perf stat -e sched:sched_migrate_task,sched:sched_switch

Results
-------

Kernel pipe bogo ops/s sched_migrate_task sched_switch
6.6.78 2 339 827 98 21 151 454
6.12.81 1 163 540 4 39 911 525

Key observations:

1. sched_migrate_task is much lower in 6.12 (4 vs 98). Producer and
consumer are not migrating — they stay on their assigned CPUs.
CPU affinity sampling confirms that workers are distributed across
all 4 cores in both kernels with no significant difference in
co-location.

2. sched_switch is nearly double in 6.12 (+89%, 39.9M vs 21.2M). The
pipe worker pairs switch context far more often without doing more
useful work.

3. usr time drops significantly in 6.12 (11.7s vs 20.3s user time
across 4 workers), while sys time increases (67.4s vs 57.7s). The
workers spend less time in
userspace and more time in the kernel per iteration.

This points to DELAY_DEQUEUE causing the sleeping consumer to remain
on the runqueue longer, generating additional scheduling overhead per
pipe iteration without improving throughput — consistent with the ~50%
regression in bogo ops/s.

Happy to run additional traces (e.g. perf sched or ftrace) if that would help.

Best regards,
Tom

Am Fr., 17. Apr. 2026 um 17:42 Uhr schrieb Vincent Guittot
<vincent.guittot@xxxxxxxxxx>:
>
> On Thu, 16 Apr 2026 at 14:23, Tom Gebhardt <tomge68@xxxxxxxxx> wrote:
> >
> > Hi Peter,
> >
> > I would like to report a measurable pipe IPC throughput regression introduced by
> > commit 152e11f ("sched/fair: Implement delayed dequeue"), first
> > present in v6.12-rc1.
> >
> > This has been independently confirmed on the official Raspberry Pi
> > Linux issue tracker
> > (raspberrypi/linux #7308), where the RPi kernel team directed the
> > issue upstream.
> >
> >
> > Hardware / Software
> > -------------------
> > - Raspberry Pi 5 Model B, BCM2712 (C1 stepping), 8 GB RAM
> > - Raspberry Pi OS Bookworm (arm64)
> > - Kernels tested: 6.6.78-v8-16k+ (rpi-6.6.y), 6.12.75+rpt-rpi-2712,
> > 6.12.81-v8-16k+ (custom)
> > - Benchmark: stress-ng 0.15.06, --pipe 4 --timeout 20s --metrics-brief
> >
> >
> > Observed regression
> > -------------------
> > Comparing pipe IPC throughput across kernels (overclocked, arm_freq=2800):
> >
> > Kernel pipe bogo ops/s vs. 6.6
> > 6.6.78 2 487 746 100%
> > 6.12.75 1 651 427 -34%
> > 6.18.21 2 049 701 -18%
> >
> > This regression pattern is consistent across two separate Raspberry Pi
> > 5 units and has
> > been independently reproduced by the RPi kernel team with 20-run averages:
> > 6.6=2065 Kops/s, 6.12=1662, 6.18=1805, 7.0=1570 (lowest).
> >
> >
> > Runtime isolation via CONFIG_SCHED_DEBUG
> > -----------------------------------------
> > To isolate the root cause, I compiled a custom kernel (rpi-6.12.y,
> > 6.12.81-v8-16k+)
> > with CONFIG_SCHED_DEBUG=y and toggled scheduler features at runtime:
> >
> > DELAY_DEQUEUE PREEMPT_SHORT pipe bogo ops/s vs. baseline
> > on (default) on (default) 1 506 572 --
> > OFF on 2 125 473 +41% <==
> > on OFF 1 419 026 -6%
> > OFF OFF 2 078 182 +38%
> >
> > Disabling DELAY_DEQUEUE alone recovers +41% throughput, almost closing
> > the gap to 6.6.
> > Disabling PREEMPT_SHORT alone has no positive effect on this workload.
> >
> > The remaining gap to 6.6 (~15%) is likely CONFIG_SCHED_DEBUG=y overhead.
> >
> >
> > Root cause analysis
> > -------------------
> > The pipe producer-consumer loop is affected by DELAY_DEQUEUE as follows:
> >
> > Before DELAY_DEQUEUE:
> > consumer reads empty pipe -> blocks -> dequeue_task() removes it from runqueue
> > producer writes -> wake_up_interruptible() -> consumer re-enqueued
> > cleanly -> runs
> >
> > With DELAY_DEQUEUE (v6.12+):
> > consumer reads empty pipe -> blocks -> stays on runqueue (sched_delayed = 1)
> > producer writes -> wakeup path handles already-queued task ->
> > additional bookkeeping
>
> because the task is still enqueued, the enqueue of delayed entity
> should be faster most of the time
>
> > per iteration
> >
> > For a tight 4-worker pipe benchmark at millions of iterations, this
> > per-iteration
> > overhead compounds directly into measured throughput.
> >
> > PREEMPT_SHORT (commit 85e511d) does not contribute to this regression.
> > Its stated
>
> Unless you set a custom slice you should not see any difference with this patch
>
> > trade-off ("massive_intr workload gets more context switches") does
> > not appear to be
> > the bottleneck here.
> >
> >
> > Mitigations tested and ruled out
> > ---------------------------------
> > - Spectre mitigations: mitigations=off yields only +0.5-2.5%
> > improvement (confirmed by
> > RPi kernel team). Not the cause.
> > - CPU governor: tested with both ondemand and performance. No
> > significant difference.
> >
> >
> > References
> > ----------
> > - Commit 152e11f (DELAY_DEQUEUE):
> > https://github.com/torvalds/linux/commit/152e11f6df293e816a6a37c69757033cdc72667d
> > - Commit 85e511d (PREEMPT_SHORT):
> > https://github.com/torvalds/linux/commit/85e511df3cec46021024176672a748008ed135bf
> > - RPi issue tracker: https://github.com/raspberrypi/linux/issues/7308
> >
> > Please let me know if additional data (perf traces, full benchmark
> > logs, kernel config)
> > would be helpful. I am happy to run further tests on the hardware.
>
> Could you trace and check that:
> - consumer and producer of one pipe are on the same cpu
> - if there is a diff in the number of migration
>
> I will try to reproduce locally once I get access to my hardware
>
> >
> > Thank you for your work on the scheduler.
> >
> > Best regards,
> > Thomas Gebhardt (@Kletternaut on GitHub)
> >