Re: sched/fair: DELAY_DEQUEUE causes ~25% pipe IPC regression on Raspberry Pi 5

From: Christian Loehle

Date: Thu Apr 16 2026 - 16:13:49 EST

On 4/16/26 13:22, Tom Gebhardt wrote:
> Hi Peter,
>
> I would like to report a measurable pipe IPC throughput regression introduced by
> commit 152e11f ("sched/fair: Implement delayed dequeue"), first
> present in v6.12-rc1.
>
> This has been independently confirmed on the official Raspberry Pi
> Linux issue tracker
> (raspberrypi/linux #7308), where the RPi kernel team directed the
> issue upstream.
>
>
> Hardware / Software
> -------------------
> - Raspberry Pi 5 Model B, BCM2712 (C1 stepping), 8 GB RAM
> - Raspberry Pi OS Bookworm (arm64)
> - Kernels tested: 6.6.78-v8-16k+ (rpi-6.6.y), 6.12.75+rpt-rpi-2712,
> 6.12.81-v8-16k+ (custom)
> - Benchmark: stress-ng 0.15.06, --pipe 4 --timeout 20s --metrics-brief
>
>
> Observed regression
> -------------------
> Comparing pipe IPC throughput across kernels (overclocked, arm_freq=2800):
>
> Kernel pipe bogo ops/s vs. 6.6
> 6.6.78 2 487 746 100%
> 6.12.75 1 651 427 -34%
> 6.18.21 2 049 701 -18%
>
> This regression pattern is consistent across two separate Raspberry Pi
> 5 units and has
> been independently reproduced by the RPi kernel team with 20-run averages:
> 6.6=2065 Kops/s, 6.12=1662, 6.18=1805, 7.0=1570 (lowest).
>

I guess running anything closer to mainline isn't an option for you?
There have been a bunch of fixes recently on eevdf, maybe some were missed,
so I did a quick skimming through.
Full list of kernel/sched/fair.c changes not on 6.18.y:
059258b0d424 ("sched/fair: Prevent negative lag increase during delayed dequeue")
089d84203ad4 ("sched/fair: Fold the sched_avg update")
0ab25ea2a3b3 ("sched/fair: Simplify task_numa_find_cpu()")
04e49d926f43 ("sched: Enable context analysis for core.c and fair.c")
15257cc2f905 ("sched/fair: Revert force wakeup preemption")
1ae5f5dfe5ad ("sched: Cleanup sched_delayed handling for class switches")
1e900f415c60 ("sched: Detect per-class runqueue changes")
2e4b28c48f88 ("treewide: Update email address")
45e09225085f ("sched/fair: Avoid rq->lock bouncing in sched_balance_newidle()")
4823725d9d1d ("sched/fair: Increase weight bits for avg_vruntime")
50653216e4ff ("sched: Add support to pick functions to take rf")
5324953c06bd ("sched/core: Fix wakeup_preempt's next_class tracking")
553255cc857c ("sched/fair: Fix math notation errors in avg_vruntime comment")
556146ce5e94 ("sched/fair: Avoid overflow in enqueue_entity()")
558c18d3fbb6 ("sched/eevdf: Fix HRTICK duration")
55b39b0cf183 ("sched/fair: Use cpumask_weight_and() in sched_balance_find_dst_group()")
5d86d542f68f ("sched/fair: Remove nohz.nr_cpus and use weight of cpumask instead")
5d88e424ec1b ("sched/fair: Make hrtick resched hard")
6455ad5346c9 ("sched: Move sched_class::prio_changed() into the change pattern")
69050f8d6d07 ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types")
6ab7973f2540 ("sched/fair: Fix sched_avg fold")
6b67c8a72e56 ("sched/fair: Move checking for nohz cpus after time check")
71fedc41c23b ("sched/fair: Switch to rcu_dereference_all()")
76504bce4ee6 ("sched/fair: Get this cpu once in find_new_ilb()")
78cde54ea5f0 ("sched/eevdf: Clear buddies for preempt_short")
82d6e01a0699 ("sched/fair: Only update stats for allowed CPUs when looking for dst group")
8d16e3c6f844 ("sched/fair: Fix comma operator misuse in NUMA fault accounting")
926475806606 ("sched/fair: Update overutilized detection")
94e70734b4d0 ("sched/fair: Change likelyhood of nohz.nr_cpus")
95a0155224a6 ("sched/fair: Limit hrtick work")
97015376642f ("sched/fair: Simplify hrtick_update()")
9fe89f022c05 ("sched/fair: More complex proportional newidle balance")
bf4afc53b77a ("Convert 'alloc_obj' family to use the new default GFP_KERNEL argument")
d3d663faa1d4 ("sched/fair: Filter false overloaded_group case for EAS")
db4551e2ba34 ("sched/fair: Use full weight to __calc_delta()")
dec9554dc036 ("sched: Move attach_one_task and attach_task helpers to sched.h")
e636ffb9e31b ("sched/deadline: Fix dl_server time accounting")
e837456fdca8 ("sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals")
f1320a8dd8ba ("sched/fair: Simplify the entry condition for update_idle_cpu_scan()")
f24165bfa7ef ("sched/headers: Rename rcu_dereference_check_sched_domain() => rcu_dereference_sched_domain()")
fa6874dfeee0 ("sched/fair: Remove superfluous rcu_read_lock() in the wakeup path")
fd54d81c2c0e ("sched/fair: Skip SCHED_IDLE rq for SCHED_IDLE task")
fe7171d0d5df ("sched/fair: Simplify SIS_UTIL handling in select_idle_cpu()")
ff1de90dd7a6 ("sched/fair: Drop useless cpumask_empty() in find_energy_efficient_cpu()")
101f3498b4bd ("sched/fair: Revert 6d71a9c61604 (sched/fair: Fix EEVDF entity placement bug causing scheduling lag)")

We will probably want to backport at least:
059258b0d424 ("sched/fair: Prevent negative lag increase during delayed dequeue")
556146ce5e94 ("sched/fair: Avoid overflow in enqueue_entity()")
101f3498b4bd ("sched/fair: Revert 6d71a9c61604 (sched/fair: Fix EEVDF entity placement bug causing scheduling lag)")
e837456fdca8 ("sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals")
78cde54ea5f0 ("sched/eevdf: Clear buddies for preempt_short")
558c18d3fbb6 ("sched/eevdf: Fix HRTICK duration")

Could you try the first three of those for you specifically?

>
> Runtime isolation via CONFIG_SCHED_DEBUG
> -----------------------------------------
> To isolate the root cause, I compiled a custom kernel (rpi-6.12.y,
> 6.12.81-v8-16k+)
> with CONFIG_SCHED_DEBUG=y and toggled scheduler features at runtime:
>
> DELAY_DEQUEUE PREEMPT_SHORT pipe bogo ops/s vs. baseline
> on (default) on (default) 1 506 572 --
> OFF on 2 125 473 +41% <==
> on OFF 1 419 026 -6%
> OFF OFF 2 078 182 +38%
>
> Disabling DELAY_DEQUEUE alone recovers +41% throughput, almost closing
> the gap to 6.6.
> Disabling PREEMPT_SHORT alone has no positive effect on this workload.
>
> The remaining gap to 6.6 (~15%) is likely CONFIG_SCHED_DEBUG=y overhead.

That seems very high, I don't think there's a measurable CONFIG_SCHED_DEBUG=y
overhead. You can also just change the default to test features with
CONFIG_SCHED_DEBUG=n

>
>
> Root cause analysis
> -------------------
> The pipe producer-consumer loop is affected by DELAY_DEQUEUE as follows:
>
> Before DELAY_DEQUEUE:
> consumer reads empty pipe -> blocks -> dequeue_task() removes it from runqueue
> producer writes -> wake_up_interruptible() -> consumer re-enqueued
> cleanly -> runs
>
> With DELAY_DEQUEUE (v6.12+):
> consumer reads empty pipe -> blocks -> stays on runqueue (sched_delayed = 1)
> producer writes -> wakeup path handles already-queued task ->
> additional bookkeeping
> per iteration
>
> For a tight 4-worker pipe benchmark at millions of iterations, this
> per-iteration
> overhead compounds directly into measured throughput.
>
> PREEMPT_SHORT (commit 85e511d) does not contribute to this regression.
> Its stated
> trade-off ("massive_intr workload gets more context switches") does
> not appear to be
> the bottleneck here.
>
>
> Mitigations tested and ruled out
> ---------------------------------
> - Spectre mitigations: mitigations=off yields only +0.5-2.5%
> improvement (confirmed by
> RPi kernel team). Not the cause.
> - CPU governor: tested with both ondemand and performance. No
> significant difference.
>
>
> References
> ----------
> - Commit 152e11f (DELAY_DEQUEUE):
> https://github.com/torvalds/linux/commit/152e11f6df293e816a6a37c69757033cdc72667d
> - Commit 85e511d (PREEMPT_SHORT):
> https://github.com/torvalds/linux/commit/85e511df3cec46021024176672a748008ed135bf

Generally no need to provide links if you have the short hash, but we use the 12
character ones:
152e11f6df29 ("sched/fair: Implement delayed dequeue")
781773e3b680 ("sched/fair: Implement ENQUEUE_DELAYED")

> - RPi issue tracker: https://github.com/raspberrypi/linux/issues/7308
>
> Please let me know if additional data (perf traces, full benchmark
> logs, kernel config)
> would be helpful. I am happy to run further tests on the hardware.
>
> Thank you for your work on the scheduler.
>
> Best regards,
> Thomas Gebhardt (@Kletternaut on GitHub)
>