Re: [PATCH v2 00/13] sched/fair/schedutil: Better manage system response time

From: Qais Yousef

Date: Sat May 30 2026 - 22:36:02 EST


On 05/29/26 09:53, Tom Gebhardt wrote:
> On 05/29/26 02:43, Qais Yousef wrote:
> > 7.0+ttwu+vincent is the best, right?
>
> Yes.
>
> > Have you verified your actual workload is seeing benefit? I think when
> > I scanned the github bug you references the original report was observing
> > a regression in some real setup, not this stressng tests.
>
> I am the reporter. The original issue (#7308) was observed as a drop in

Yes, I realized, thanks for taking the time to chase all of this :)

> camera frame rate when running two parallel IMX477 streams via libcamera on
> RPi5 under kernel 6.12+. The camera pipeline is pipe-IPC-heavy (GStreamer /
> libcamera internal queues), so the regression surfaced there first in a real
> workload. To isolate the cause I moved to synthetic pipe benchmarks
> (stress-ng), which confirmed and quantified the regression cleanly.
> A Raspberry Pi developer (popcornmix) also posted IPC benchmark results on
> the issue, independently confirming the trend across kernel versions
> (6.6=2065 > 6.18=1805 > 6.12=1662 > 7.0=1570 Kops/s).

I was wondering if the real workload is as sensitive

>
> The stress-ng pipe stressor is therefore not an artificial worst-case -- it
> directly exercises the code path that causes the real-world camera regression.
> That said, I agree stress-ng amplifies the effect, and I cannot give you an
> exact frame-rate number yet with the ttwu+vincent patches applied.

No worries, don't want to ask you to do more work ;-)

>
> > IPC drops 14% on 7.0 stock. Due to stalling you reckon?
>
> Yes. The branch misprediction rate explains most of it. On Cortex-A76 a
> branch mispredict costs ~13 cycles. Normalised by instruction count:
>
> Kernel branch-miss rate vs 6.6
> ----------------- ----------------- ------
> 6.6.78 0.178% ref
> 7.0.0 stock 0.427% +140%
> 7.0.0+ttwu+vincent 0.271% +52%

This could potentially be due to the higher ctx switches

>
> The raw counts I reported yesterday were misleading because the instruction
> counts differ between kernels (different amounts of useful work). Apologies
> for not normalising upfront. The rate tells a cleaner story: stock EEVDF
> causes 2.4× more mispredictions per instruction than CFS, and ttwu+vincent
> brings that down to 1.5× -- significant improvement but not full recovery.

Note 6.6 kernels are EEVDF too.

>
> > Do you have the full output? It would be interesting to use perf diff.
>
> A proper perf diff with resolved kernel symbols requires running against the
> matching kernel. I ran `perf report --no-children -s symbol` on each .data
> file while booted on the corresponding kernel. Key findings:
>
> 7.0.0 stock (flat, self-overhead):
>
> 12.98% finish_task_switch.isra.0
> -> __schedule -> schedule
> -> anon_pipe_read 5.72%
> -> anon_pipe_write 1.38%
>
> 7.0.0+ttwu+vincent (flat, self-overhead):
>
> 19.62% finish_task_switch.isra.0
> -> __schedule -> schedule
> -> anon_pipe_read 8.22%
> -> anon_pipe_write 4.34%
>
> The striking difference is in the pipe_write -> schedule() path: 1.38% on
> stock vs 4.34% with ttwu+vincent. The ttwu patches make pipe writers yield
> the CPU far more aggressively after each write, allowing the reader to run
> immediately. Stock EEVDF leaves this to the scheduler's own timing, which
> results in more latency and lower throughput.

I collected a trace for stress-ng --pipe 2 on a 2 CPU system (6.8 kernel) and
I can see it ends up with 4 tasks, 2 almost always running and 2 that sleep and
wake up, rather rapidly.

stress-ng 1: 58% RUNNING, 41% RUNNABLE, ~1% sleeping
stress-ng 2: 41.5% RUNNING, 58.5% RUNNABLE, ~1% sleeping
stress-ng 3: 40.1% RUNNING, 22.2% RUNNABLE, ~37.6% sleeping
stress-ng 4: 59.9% RUNNING, 21.7% RUNNABLE, ~18.5% sleeping

The avg RUNNING time of these tasks is few 10s of us and min is 100s of ns..

It seems the tasks are pinned too, 2 per cpu.

I hope your real workload doesn't behave this way, this is very inefficient :)

>
> The higher absolute percentage in finish_task_switch for vincent is expected:
> vincent completes ~24% more pipe operations in the same wall time, so there
> are proportionally more context switches completing.
>
> On 6.6 (from the call-graph profile recorded separately), finish_task_switch
> is not visible as a top-level hotspot at all -- consistent with CFS handling
> this path much more efficiently.

It could also be about the wakeup preemption pattern. The pattern I see is that
one task wakes up runs for a bit before the other tasks wakes up rapidly for
4 times. The first 3 it preempts with ~0.5us but the last one it waits behind
the original task until it sleeps which takes ~9us.

If I do

echo NO_WAKEUP_PREEMPTION | sudo tee /sys/kernel/debug/features

I can see the bogo ops/s jump by 16%.

The two tasks now interleave equally and the tasks that had ~1% sleeping time
now go up to 7 and 10% of sleeping time.

You can achieve the same outcome by running as SCHED_BATCH

chrt -b 0 stress-ng --pipe 4 --timeout 20s --metrics-brief

>
> > Maybe there's higher rq lock contention. But this finish_task_switch and
> > __raw_spin_unlock_irqrestore are common to see, especially when there's
> > high context switch rate.
>
> Agreed -- I cannot rule out rq lock contention without perf diff with
> matched build-IDs. The pattern I see (finish_task_switch dominant, driven
> by pipe_read/write) is consistent with high context switch rate rather than
> a pathological lock. But your point about a 'hot variable' like rq->clock
> is noted -- I cannot confirm or deny that from flat profiles alone.
>
> > I hope perfetto trace will help visualize the pattern that led to this
> > higher context switching.
>
> I will work on getting a perfetto trace. Expecting to have that in a
> follow-up.
>
> Tom