sched/fair: DELAY_DEQUEUE causes ~25% pipe IPC regression on Raspberry Pi 5

From: Tom Gebhardt

Date: Thu Apr 16 2026 - 08:22:59 EST

Hi Peter,

I would like to report a measurable pipe IPC throughput regression introduced by
commit 152e11f ("sched/fair: Implement delayed dequeue"), first
present in v6.12-rc1.

This has been independently confirmed on the official Raspberry Pi
Linux issue tracker
(raspberrypi/linux #7308), where the RPi kernel team directed the
issue upstream.

Hardware / Software
-------------------
- Raspberry Pi 5 Model B, BCM2712 (C1 stepping), 8 GB RAM
- Raspberry Pi OS Bookworm (arm64)
- Kernels tested: 6.6.78-v8-16k+ (rpi-6.6.y), 6.12.75+rpt-rpi-2712,
6.12.81-v8-16k+ (custom)
- Benchmark: stress-ng 0.15.06, --pipe 4 --timeout 20s --metrics-brief

Observed regression
-------------------
Comparing pipe IPC throughput across kernels (overclocked, arm_freq=2800):

Kernel pipe bogo ops/s vs. 6.6
6.6.78 2 487 746 100%
6.12.75 1 651 427 -34%
6.18.21 2 049 701 -18%

This regression pattern is consistent across two separate Raspberry Pi
5 units and has
been independently reproduced by the RPi kernel team with 20-run averages:
6.6=2065 Kops/s, 6.12=1662, 6.18=1805, 7.0=1570 (lowest).

Runtime isolation via CONFIG_SCHED_DEBUG
-----------------------------------------
To isolate the root cause, I compiled a custom kernel (rpi-6.12.y,
6.12.81-v8-16k+)
with CONFIG_SCHED_DEBUG=y and toggled scheduler features at runtime:

DELAY_DEQUEUE PREEMPT_SHORT pipe bogo ops/s vs. baseline
on (default) on (default) 1 506 572 --
OFF on 2 125 473 +41% <==
on OFF 1 419 026 -6%
OFF OFF 2 078 182 +38%

Disabling DELAY_DEQUEUE alone recovers +41% throughput, almost closing
the gap to 6.6.
Disabling PREEMPT_SHORT alone has no positive effect on this workload.

The remaining gap to 6.6 (~15%) is likely CONFIG_SCHED_DEBUG=y overhead.

Root cause analysis
-------------------
The pipe producer-consumer loop is affected by DELAY_DEQUEUE as follows:

Before DELAY_DEQUEUE:
consumer reads empty pipe -> blocks -> dequeue_task() removes it from runqueue
producer writes -> wake_up_interruptible() -> consumer re-enqueued
cleanly -> runs

With DELAY_DEQUEUE (v6.12+):
consumer reads empty pipe -> blocks -> stays on runqueue (sched_delayed = 1)
producer writes -> wakeup path handles already-queued task ->
additional bookkeeping
per iteration

For a tight 4-worker pipe benchmark at millions of iterations, this
per-iteration
overhead compounds directly into measured throughput.

PREEMPT_SHORT (commit 85e511d) does not contribute to this regression.
Its stated
trade-off ("massive_intr workload gets more context switches") does
not appear to be
the bottleneck here.

Mitigations tested and ruled out
---------------------------------
- Spectre mitigations: mitigations=off yields only +0.5-2.5%
improvement (confirmed by
RPi kernel team). Not the cause.
- CPU governor: tested with both ondemand and performance. No
significant difference.

References
----------
- Commit 152e11f (DELAY_DEQUEUE):
https://github.com/torvalds/linux/commit/152e11f6df293e816a6a37c69757033cdc72667d
- Commit 85e511d (PREEMPT_SHORT):
https://github.com/torvalds/linux/commit/85e511df3cec46021024176672a748008ed135bf
- RPi issue tracker: https://github.com/raspberrypi/linux/issues/7308

Please let me know if additional data (perf traces, full benchmark
logs, kernel config)
would be helpful. I am happy to run further tests on the hardware.

Thank you for your work on the scheduler.

Best regards,
Thomas Gebhardt (@Kletternaut on GitHub)