Re: [PATCH v2 00/10] sched: Flatten the pick

From: Vincent Guittot

Date: Tue May 12 2026 - 04:43:52 EST

On Mon, 11 May 2026 at 14:07, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> Hi!
>
> So cgroup scheduling has always been a pain in the arse. The problems start
> with weight distribution and end with hierachical picks and it all sucks.
>
> The problems with weight distribution are related to that infernal global
> fraction:
>
> tg->w * grq_i->w
> ge_i->w = ----------------
> \Sum_j grq_j->w
>
> which we've approximated reasonably well by now. However, the immediate
> consequence of this fraction is that the total group weight (tg->w) gets
> fragmented across all your CPUs. And at 64 CPUs that means your per-cpu cgroup
> weight ends up being a nice 19 task worth. And more CPUs more tiny. Combine
> with the fact that 256 CPU systems are relatively common these days, this
> becomes painful.
>
> The common 'solution' is to inflate the group weight by 'nr_cpus'; the
> immediate problem with that is that when all load of a group gets concentrated
> on a single CPU, the per-cpu cgroup weight becomes insanely large, easily
> exceeding nice -20.
>
> Additionally there are numerical limits on the max weight you can have before
> the math starts suffering overflows. As such there is a definite limit on the
> total group weight. Which has annoyed people ;-)
>
> The first few patches add a knob /debug/sched/cgroup_mode and a few different
> options on how to deal with this. My favourite is 'concur', but obviously that
> is also the most expensive one :-/ It adds a tg->tasks counter which makes the
> update_tg_load_avg() thing more expensive.
>
> I have some ideas but I figured I ought to share these things before sinking
> more time into it.
>
>
> On to the hierarchical pick; this has been causing trouble for a very long
> time. So once again an attempt at flatting it. The basic idea is to keep the
> full hierarchical load tracking as-is, but keep all the runnable entities in a
> single level. The immediate concequence of all this is ofcourse that we need to
> constantly re-compute the effective weight of each entity as things progress.
>
> Reweight is done on:
> - enqueue
> - pick -- or rather set_next_entity(.first=true)
> - tick
>
> So while the {en,de}queue operations are still O(depth) due to the full
> accounting mess, the pick is now a single level. Removing the intermediate
> levels that obscure runnability etc.
>
>
> For testing, I've done a little experiment, I dug out what is colloqually known
> as a potato. A trusty old Sandybridge 12600k with a RX 580, and ran a game on
> it. From GOG, I had available 'Shadows: Awakens', a fun title that normally
> runs really well on this machine (provided you stick to 1080p).
>
> To make it interesting, I added 8 (one for each logical CPU) copies of: 'nice
> spin.sh'; this results in the game becoming almost unplayable, as in proper
> terrible.
>
> I used MangoHUD to record a few minutes of playtime for statistics, and then
> quit the came and re-started it with a shorter slice set (base/10). This
> results in the game being entirely playable -- not great, but definiltey
> playable.
>
> Lutris / GE-Proton10-34 / Steam Runtime 3 (sniper)
> Intel Core i7-2600K
> AMD Radeon RX 580
>
> Shadows Awakening (GOG)
>
> default slice(*)
>
> FPS min 3.8 20.6
> avg 48.0 57.2
> mag 87.4 80.3
>
> FT min 9.4 8.4
> avg 34.5 19.5
> max 107.4 37.2
>
> FPS (Frames Per Second)
> FT (FrameTime)
>
> [*] Command prefix: 'chrt -o --sched-runtime 280000 0'
> effectively setting 'base_slice_ns/10'
>
> I have not compared to a kernel without flat on, just wanted to run non trivial
> workloads and play with slice to make sure everything 'works'.

I haven't reviewed the patches yet but I ran some tests with it while
testing sched latency related changes for short slice wakeup
preemption. I have some large hackbench regressions with this series
on HMP system with and without EAS. those figures are unexpected
because the benchs run on root cfs

One example with hackbench 8 groups thread pipe
tip/sched/core tip/sched/core +this patchset +this patchset
slice 2.8ms 16ms 2.8ms 16ms
dragonboard rb5 with EAS
0,748(+/-4,6%) 0,621(+/-3.6%) +17% 1,915(+/-7.9%) -156%
0,689(+/- 9.1%) +8%

radxa orion6 HMP without EAS
0,588(+/-5.8%) 0,677(+/-5.9%) -15% 1,505(+/-10%) -156%
1,071(+/-5.9%) -82%

Increasing the slice partly removes regressions but tis is surprising
because the bench runs at root cfs and I thought that results will not
change in such a case

I will review the patchset and try to get what is going wrong

>
>
> Can also be had:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/flat
>
> include/linux/cpuset.h | 6
> include/linux/sched.h | 1
> kernel/cgroup/cpuset.c | 15
> kernel/sched/core.c | 47 --
> kernel/sched/debug.c | 171 +++++---
> kernel/sched/fair.c | 1038 ++++++++++++++++++++++---------------------------
> kernel/sched/pelt.c | 6
> kernel/sched/sched.h | 44 --
> 8 files changed, 672 insertions(+), 656 deletions(-)
>
> ---
> Change since v1 ( https://patch.msgid.link/20260317095113.387450089@xxxxxxxxxxxxx ):
> - various Sashiko thingies
> - rebase atop curren -tip
>
>