Re: [PATCH 4/4] sched/fair: Proportional newidle balance

From: Mario Roy

Date: Thu Jan 29 2026 - 20:45:03 EST

Peter, thank you for your fix to improve EEVDF.

Cc'd Andrea Righi
Thank you for the is_idle_core() function and help. [0]

Cc'd Shubhang Kaushik
Your patch inspired me to perform trial and error testing.
What has now become the 0280 patch in CachyMod GitHub repo. [0]

Together with the help of CachyOS community members, we concluded
the prefcore + prefer-idle-core to be surreal. I enjoy the EEVDF
scheduler a lot more, since lesser favoring the SMT siblings.

For comparison, I added results for sched-ext cosmos.

Limited CPU saturation can be revealing of potential scheduler issues.
Testing includes 100%, 50%, 31.25%, and 25% CPU saturation.
All kernels built with GCC to factor out CLANG/AutoFDO.

A) 6.18.8-rc1
with sched/fair: Proportional newidle balance

48cpus(100%) 24cpus(50%) 15cpus(31.25%) 12cpus(25%)
algorithm3 [1] 9.462s 14.181s 20.311s 24.498s
darktable [2] 2.811s 3.715s 5.315s 6.434s
easywave [3] 19.747s 10.804s 20.207s 21.571s
stress-ng [4] 37632.06 56220.21 41694.50 34740.58

B) 6.18.8-rc1
Peter Z's fix for sched/fair: Proportional newidle balance

48cpus(100%) 24cpus(50%) 15cpus(31.25%) 12cpus(25%)
algorithm3 [1] 9.340s 14.733s 21.339s 25.069s
darktable [2] 2.493s 3.616s 5.148s 5.968s
easywave [3] 11.357s 13.312s * 18.483s 20.741s
stress-ng [4] 37533.24 55419.85 39452.17 32217.55

algorithm3 and stress-ng regressed, possibly limited CPU saturation anomaly
easywave (*) wierd result, repeatable and all over the place

C) 6.18.8-rc1
Revert sched/fair: Proportional newidle balance

48cpus(100%) 24cpus(50%) 15cpus(31.25%) 12cpus(25%)
algorithm3 [1] 9.286s 15.101s 21.417s 25.126s
darktable [2] 2.484s 3.531s 5.185s 6.002s
easywave [3] 11.517s 12.300s 18.466s 20.428s
stress-ng [4] 42231.92 47306.18 * 32438.03 * 28820.83 *

stress-ng (*) lack-luster with limited CPU saturation

D) 6.18.8-rc1
Revert sched/fair: Proportional newidle balance
Plus apply the prefer-idle-core patch [0]

48cpus(100%) 24cpus(50%) 15cpus(31.25%) 12cpus(25%)
algorithm3 [1] 9.312s 11.292s 17.243s 21.811s
darktable [2] 2.418s 3.711s * 5.499s * 6.510s *
easywave [3] 10.035s 9.832s 15.738s 18.805s
stress-ng [4] 44837.41 63364.56 55646.26 48202.58

darktable (*) lesser performance with limited CPU saturation
noticeably better performance, otherwise

E) scx_cosmos -m 0-5 -s 800 -l 8000 -f -c 1 -p 0 [5]

48cpus(100%) 24cpus(50%) 15cpus(31.25%) 12cpus(25%)
algorithm3 [1] 9.218s 11.188s 17.045s 21.130s
darktable [2] 2.365s 3.900s 4.626s 5.664s
easywave [3] 9.187s 16.528s * 15.933s 16.991s
stress-ng [4] 21065.70 36417.65 27185.95 23141.87

easywave (*) sched-ext cosmos appears to favor SMT siblings

---
[0] https://github.com/marioroy/cachymod
the prefer-idle-core is 0280-prefer-prevcpu-for-wakeup.patch
more about mindfulness for limited CPU saturation versus accepting patch
surreal is prefcore + prefer-idle-core, improving many workloads

[1] https://github.com/marioroy/mce-sandbox
./algorithm3.pl 1e12 --threads=N
algorithm3.pl is akin to server/client application; chatty
primesieve.pl is more CPU-bound; less chatty
optionally, compare with primesieve binary (fully cpu bound, no chatty)
https://github.com/kimwalisch/primesieve

[2] https://math.dartmouth.edu/~sarunas/darktable_bench.html
OMP_NUM_THREADS=N darktable-cli setubal.orf setubal.orf.xmp test.jpg \
--core --disable-opencl -d perf
result: pixel pipeline processing took {...} secs

[3] https://openbenchmarking.org/test/pts/easywave
OMP_NUM_THREADS=N ./src/easywave \
-grid examples/e2Asean.grd -source examples/BengkuluSept2007.flt \
-time 600
result: Model time = 10:00:00, elapsed: {...} msec

[4] https://openbenchmarking.org/test/pts/stress-ng
stress-ng -t 30 --metrics-brief --sock N --no-rand-seed --sock-zerocopy
result: bogo ops real time usr time sys time bogo ops/s bogo ops/s
(secs) (secs) (secs) (real time) (usr+sys time)
{...}
this involves 2x NCPUs due to { writer, reader } threads per sock
thus the reason adding 12cpus result (12 x 2 = 24 <= 50% saturation)

[5] https://github.com/sched-ext/scx
cargo build --release -p scx_cosmos

On 1/27/26 10:17 AM, Peter Zijlstra wrote:

On Tue, Jan 27, 2026 at 11:40:41AM +0100, Peter Zijlstra wrote:

On Fri, Jan 23, 2026 at 12:03:06PM +0100, Peter Zijlstra wrote:

On Fri, Jan 23, 2026 at 11:50:46AM +0100, Peter Zijlstra wrote:

On Sun, Jan 18, 2026 at 03:46:22PM -0500, Mario Roy wrote:

The patch "Proportional newidle balance" introduced a regression
with Linux 6.12.65 and 6.18.5. There is noticeable regression with
easyWave testing. [1]

The CPU is AMD Threadripper 9960X CPU (24/48). I followed the source
to install easyWave [2]. That is fetching the two tar.gz archives.

What is the actual configuration of that chip? Is it like 3*8 or 4*6
(CCX wise). A quick google couldn't find me the answer :/

Obviously I found it right after sending this. It's a 4x6 config.
Meaning it needs newidle to balance between those 4 domains.

So with the below patch on top of my Xeon w7-2495X (which is 24-core
48-thread) I too have 4 LLC :-)

And I think I can see a slight difference, but nowhere near as terrible.

Let me go stick some tracing on.

Does this help some?

Turns out, this easywave thing has a very low newidle rate, but then
also a fairly low success rate. But since it doesn't do it that often,
the cost isn't that significant so we might as well always do it etc..

This adds a second term to the ratio computation that takes time into
account, For low rate newidle this term will dominate, while for higher
rate the success ratio is more important.

Chris, afaict this still DTRT for schbench, but if this works for Mario,
could you also re-run things at your end?

[ the 4 'second' thing is a bit random, but looking at the timings
between easywave and schbench this seems to be a reasonable middle
ground. Although I think 8 'seconds' -- 23 shift -- would also work.

That would give:

1024 - 8 s - 64 Hz
512 - 4 s - 128 Hz
256 - 2 s - 256 Hz
128 - 1 s - 512 Hz
64 - .5 s - 1024 Hz
32 - .25 s - 2048 Hz
]

---

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 45c0022b91ce..a1e1032426dc 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -95,6 +95,7 @@ struct sched_domain {
unsigned int newidle_call;
unsigned int newidle_success;
unsigned int newidle_ratio;
+ u64 newidle_stamp;
u64 max_newidle_lb_cost;
unsigned long last_decay_max_lb_cost;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eca642295c4b..ab9cf06c6a76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12224,8 +12224,31 @@ static inline void update_newidle_stats(struct sched_domain *sd, unsigned int su
sd->newidle_call++;
sd->newidle_success += success;
if (sd->newidle_call >= 1024) {
- sd->newidle_ratio = sd->newidle_success;
+ u64 now = sched_clock();
+ s64 delta = now - sd->newidle_stamp;
+ sd->newidle_stamp = now;
+ int ratio = 0;
+
+ if (delta < 0)
+ delta = 0;
+
+ if (sched_feat(NI_RATE)) {
+ /*
+ * ratio delta freq
+ *
+ * 1024 - 4 s - 128 Hz
+ * 512 - 2 s - 256 Hz
+ * 256 - 1 s - 512 Hz
+ * 128 - .5 s - 1024 Hz
+ * 64 - .25 s - 2048 Hz
+ */
+ ratio = delta >> 22;
+ }
+
+ ratio += sd->newidle_success;
+
+ sd->newidle_ratio = min(1024, ratio);
sd->newidle_call /= 2;
sd->newidle_success /= 2;
}
@@ -12932,7 +12959,7 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
if (sd->flags & SD_BALANCE_NEWIDLE) {
unsigned int weight = 1;
- if (sched_feat(NI_RANDOM)) {
+ if (sched_feat(NI_RANDOM) && sd->newidle_ratio < 1024) {
/*
* Throw a 1k sided dice; and only run
* newidle_balance according to the success
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 980d92bab8ab..7aba7523c6c1 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -126,3 +126,4 @@ SCHED_FEAT(LATENCY_WARN, false)
* Do newidle balancing proportional to its success rate using randomization.
*/
SCHED_FEAT(NI_RANDOM, true)
+SCHED_FEAT(NI_RATE, true)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index cf643a5ddedd..05741f18f334 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -4,6 +4,7 @@
*/
#include <linux/sched/isolation.h>
+#include <linux/sched/clock.h>
#include <linux/bsearch.h>
#include "sched.h"
@@ -1637,6 +1638,7 @@ sd_init(struct sched_domain_topology_level *tl,
struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
int sd_id, sd_weight, sd_flags = 0;
struct cpumask *sd_span;
+ u64 now = sched_clock();
sd_weight = cpumask_weight(tl->mask(tl, cpu));
@@ -1674,6 +1676,7 @@ sd_init(struct sched_domain_topology_level *tl,
.newidle_call = 512,
.newidle_success = 256,
.newidle_ratio = 512,
+ .newidle_stamp = now,
.max_newidle_lb_cost = 0,
.last_decay_max_lb_cost = jiffies,