Re: [PATCH v3 0/7] sched: Flatten the pick

From: Shubhang Kaushik

Date: Thu Jun 11 2026 - 22:30:09 EST

Hello Peter,

I applied the `sched/flat` patchset from your tree on top of the
`tip/sched/core` base commit (9ebe5c3c29f62)/(7.1-rc2)

The evaluation was performed on an 80-core, Ampere Altra
system running Fedora Linux 41.

Benchmark Runs:-

1. Hackbench (Execution time in seconds: lower is better)
The data reveals a clear architectural pivot point at 4 tasks:
- Low Concurrency (< 4 tasks): Regresses by +1.8% to +4.0%.
Removing cgroup isolation boundaries expands the idle CPU search
adding slight overhead to the wake-up path.
* 1 Thread: (+1.8%)
* 2 Threads: (+4.0%)
* 2 Procs: (+3.3%)
- Tipping Point (4 tasks): Performance is completely flat.
* 4 Threads: (+0.03%)
* 4 Procs: (+0.1%)
- High Concurrency (>= 8 tasks): Improves by -0.7% to -2.3%.
Collapsing the tree structure down to a flat layout removes
multi-layer load tracking updates (update_load_avg), saving cycles
under load.
* 8 Threads: (-0.7%)
* 16 Threads: (-1.8%)
* 8 Procs: (-1.2%)
* 16 Procs: (-2.3%)
* 32 Procs: (-1.6%)

2. Schbench (Wakeup Tail Latency)
- 16 Threads (128kb footprint): 99.9th percentile tail latency drops
significantly by -12.21% (us). Operating on a unified runqueue layer
prevents induced group-level throttling.
- 32 Threads (128kb footprint): 99.9th percentile tail latency
regresses by +5.50% (us). Eliminating nested queues increases lock
contention during heavy simultaneous wakeups.

3. Sysbench
- Sysbench RAM: Throughput increases by +1.55% (MiB/sec). Fewer tree
traversals reduce cache-line bouncing, freeing up cycles.

The patchset trades minor low-load performance for better scaling and
tighter tail latencies under distributed load. However, the majority of
these deltas remain small and sit near the measurement noise floor (<=
4%).

Regards,
Shubhang Kaushik

On Fri, 5 Jun 2026, Peter Zijlstra wrote:

Hi!

New version, same story [1]. TL;DR:

- Adds new cgroup_mode knob and implements new policies to address the
hierarchy level weight mismatch.

- Builds upon that base to create a flat / single runqueue scheduler where the
cgroup hierarchy is expressed through dynamic weight management.

I'm hoping to be able to merge these patches early in the next cycle (after
7.2-rc1).

Random benchmark:

Game vs 'for ((i=0; i<8; i++)) do nice ./spin.sh; done':

Lutris / GE-Proton10-34 / Steam Runtime 3 (sniper)
Intel Core i7-2600K
AMD Radeon RX 580

Shadows Awakening (GOG)

default slice(*)

FPS min 4.0 29.0
avg 47.5 59.2
max 83.7 83.7

FT min 9.3 10.2
avg 34.0 17.0
max 121.2 30.0

FPS (Frames Per Second)
FT (FrameTime)

[*] Command prefix: 'chrt -o --sched-runtime 100000 0'

Changes since v2:

- merged debug and prep patches
- fixed update_entity_lag() on dequeue (Vincent)
- fixed throttle vs tick (Prateek)
- fixed wakeup_preempt_fair()
- rebased on tip/sched/core
- rewritten cgroup_mode changelogs
- reworked cgroup_mode concur
- added cgroup_mode tasks
- changed default cgroup_mode

[1] - https://lore.kernel.org/r/20260511113104.563854162@xxxxxxxxxxxxx

Can also be had:

git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/flat

include/linux/cpuset.h | 6
include/linux/sched.h | 1
kernel/cgroup/cpuset.c | 15
kernel/sched/core.c | 5
kernel/sched/debug.c | 89 ++++
kernel/sched/fair.c | 943 ++++++++++++++++++++++++-------------------------
kernel/sched/pelt.c | 6
kernel/sched/sched.h | 30 -
8 files changed, 607 insertions(+), 488 deletions(-)