Re: [PATCH v2 00/13] sched/fair/schedutil: Better manage system response time

From: Qais Yousef

Date: Fri May 15 2026 - 23:01:46 EST

On 05/15/26 10:24, Tom Gebhardt wrote:
> Hi Qais,
>
> Thanks for the follow-up. Here are the patch isolation results and answers to your questions.
>
> Regarding the governor:
>
> Yes, I'm running `ondemand`, not `schedutil`. My mistake for not mentioning that upfront - I
> assumed the improvement was due to the util_est path being triggered regardless of the governor.
> The improvement is clearly measurable even with `ondemand`, which is surprising given that your
> patches specifically target `schedutil`.
>
> Patch isolation -- 12/13 only vs. both:
>
> I re-ran the benchmarks with patch 13/13 (`sched/pelt: Always allow load updates`) reverted,
> keeping only patch 12/13 (`sched/fair: Call update_util_est() after dequeue_entities()`).
>
> Results using stress-ng 0.15.06 pipe stressor (4 workers, 20s):
>
> Kernel Clock pipe bogo ops/s delta vs. 6.6
> ----------------------------------- -------- ---------------- -------------
> 6.6.78-v8-16k+ 2400 MHz 2 129 330 +/-0% (ref)
> 6.6.78-v8-16k+ 2800 MHz 2 487 746 +/-0% (ref)
> 7.0.0-v8-16k+ stock 2400 MHz 1 694 011 -20.5%
> 7.0.0-v8-16k+ stock 2800 MHz 1 851 567 -25.6%
> 7.0.0 + ttwu only (10 patches) 2400 MHz 1 836 006 -13.8%
> 7.0.0 + ttwu only (10 patches) 2800 MHz 1 934 076 -22.3%
> 7.0.0 + ttwu + patch 12/13 only 2400 MHz 2 054 879 -3.5%
> 7.0.0 + ttwu + patch 12/13 only 2800 MHz 2 415 617 -2.9%
> 7.0.0 + ttwu + patches 12+13 2400 MHz 1 996 002 -6.3%
> 7.0.0 + ttwu + patches 12+13 2800 MHz 2 342 144 -5.9%
>
> The key finding: patch 12/13 alone outperforms the combined set on ARM. Adding patch 13/13
> actually hurts performance slightly -- about 3 percentage points -- at both clock speeds. This
> suggests that `sched/pelt: Always allow load updates` has a negative interaction on ARM/Cortex-A76,
> possibly related to how PELT decay is handled without `schedutil` active, or an ARM-specific
> DELAY_DEQUEUE interaction.
>
> Patch 12/13 alone closes the gap to just -2.9% vs. 6.6 at 2800 MHz (OC), and -3.5% at nominal
> 2400 MHz. That is a remarkable recovery from the -31.9% regression in 7.0 stock.
>
> Regarding Perfetto traces:
>
> Unfortunately I cannot provide sched-analyzer traces at this time -- the kernel is not compiled
> with CONFIG_DEBUG_INFO_BTF=y (pahole/dwarves not available in this build environment), which
> is required for BPF CO-RE. I can try to arrange that for a future run if it would still be useful.

You don't need to have it enabled in the kernel. I don't need util info, by
default if you don't pass any arg it should not cause BPF to be loaded. Note
there are binaries in the release page on github, so you don't have to compile
it. You can also use regular perfetto command to record a trace and visualize it
and the sched-analyzer-pp would be able to analayze it.

I am looking to see if the task placement and running/runnable time pattern
has changed significantly to cause the big difference.

It'd be good to perf it too. You might be hitting weird contention that the
patch just happens to accidentally hide.

>
> Device: Raspberry Pi 5 (8 GB, C1-stepping), Bookworm arm64, kernel rpi-7.0.y.
> Background: https://github.com/raspberrypi/linux/issues/7308
>
> Tom