Re: [PATCH 4/4] sched/fair: Proportional newidle balance

From: Mario Roy

Date: Thu Jan 29 2026 - 20:45:03 EST


Peter, thank you for your fix to improve EEVDF.

Cc'd Andrea Righi
Thank you for the is_idle_core() function and help. [0]

Cc'd Shubhang Kaushik
Your patch inspired me to perform trial and error testing.
What has now become the 0280 patch in CachyMod GitHub repo. [0]

Together with the help of CachyOS community members, we concluded
the prefcore + prefer-idle-core to be surreal. I enjoy the EEVDF
scheduler a lot more, since lesser favoring the SMT siblings.

For comparison, I added results for sched-ext cosmos.

Limited CPU saturation can be revealing of potential scheduler issues.
Testing includes 100%, 50%, 31.25%, and 25% CPU saturation.
All kernels built with GCC to factor out CLANG/AutoFDO.

A) 6.18.8-rc1
   with sched/fair: Proportional newidle balance

                    48cpus(100%)  24cpus(50%)  15cpus(31.25%) 12cpus(25%)
   algorithm3 [1]       9.462s      14.181s        20.311s 24.498s
   darktable  [2]       2.811s       3.715s         5.315s  6.434s
   easywave   [3]      19.747s      10.804s        20.207s 21.571s
   stress-ng  [4]     37632.06     56220.21       41694.50  34740.58

B) 6.18.8-rc1
   Peter Z's fix for sched/fair: Proportional newidle balance

                    48cpus(100%)  24cpus(50%)  15cpus(31.25%) 12cpus(25%)
   algorithm3 [1]       9.340s      14.733s        21.339s 25.069s
   darktable  [2]       2.493s       3.616s         5.148s  5.968s
   easywave   [3]      11.357s      13.312s *      18.483s 20.741s
   stress-ng  [4]     37533.24     55419.85       39452.17  32217.55

   algorithm3 and stress-ng regressed, possibly limited CPU saturation anomaly
   easywave (*) wierd result, repeatable and all over the place

C) 6.18.8-rc1
   Revert sched/fair: Proportional newidle balance

                    48cpus(100%)  24cpus(50%)  15cpus(31.25%) 12cpus(25%)
   algorithm3 [1]       9.286s      15.101s        21.417s 25.126s
   darktable  [2]       2.484s       3.531s         5.185s  6.002s
   easywave   [3]      11.517s      12.300s        18.466s 20.428s
   stress-ng  [4]     42231.92     47306.18 *     32438.03 *  28820.83 *

   stress-ng (*) lack-luster with limited CPU saturation

D) 6.18.8-rc1
   Revert sched/fair: Proportional newidle balance
   Plus apply the prefer-idle-core patch [0]

                    48cpus(100%)  24cpus(50%)  15cpus(31.25%) 12cpus(25%)
   algorithm3 [1]       9.312s      11.292s        17.243s 21.811s
   darktable  [2]       2.418s       3.711s *       5.499s *  6.510s *
   easywave   [3]      10.035s       9.832s        15.738s 18.805s
   stress-ng  [4]     44837.41     63364.56       55646.26  48202.58

   darktable (*) lesser performance with limited CPU saturation
   noticeably better performance, otherwise

E) scx_cosmos -m 0-5 -s 800 -l 8000 -f -c 1 -p 0 [5]

                    48cpus(100%)  24cpus(50%)  15cpus(31.25%) 12cpus(25%)
   algorithm3 [1]       9.218s      11.188s        17.045s 21.130s
   darktable  [2]       2.365s       3.900s         4.626s  5.664s
   easywave   [3]       9.187s      16.528s *      15.933s 16.991s
   stress-ng  [4]     21065.70     36417.65       27185.95  23141.87

   easywave (*) sched-ext cosmos appears to favor SMT siblings

---
[0] https://github.com/marioroy/cachymod
    the prefer-idle-core is 0280-prefer-prevcpu-for-wakeup.patch
    more about mindfulness for limited CPU saturation versus accepting patch
    surreal is prefcore + prefer-idle-core, improving many workloads

[1] https://github.com/marioroy/mce-sandbox
    ./algorithm3.pl 1e12 --threads=N
    algorithm3.pl is akin to server/client application; chatty
    primesieve.pl is more CPU-bound; less chatty
    optionally, compare with primesieve binary (fully cpu bound, no chatty)
    https://github.com/kimwalisch/primesieve

[2] https://math.dartmouth.edu/~sarunas/darktable_bench.html
    OMP_NUM_THREADS=N darktable-cli setubal.orf setubal.orf.xmp test.jpg \
    --core --disable-opencl -d perf
    result: pixel pipeline processing took {...} secs

[3] https://openbenchmarking.org/test/pts/easywave
    OMP_NUM_THREADS=N ./src/easywave \
    -grid examples/e2Asean.grd -source examples/BengkuluSept2007.flt \
    -time 600
    result: Model time = 10:00:00,   elapsed: {...} msec

[4] https://openbenchmarking.org/test/pts/stress-ng
    stress-ng -t 30 --metrics-brief --sock N --no-rand-seed --sock-zerocopy
    result: bogo ops real time  usr time  sys time   bogo ops/s  bogo ops/s
                       (secs)    (secs)    (secs)   (real time) (usr+sys time)
                                                       {...}
    this involves 2x NCPUs due to { writer, reader } threads per sock
    thus the reason adding 12cpus result (12 x 2 = 24 <= 50% saturation)

[5] https://github.com/sched-ext/scx
    cargo build --release -p scx_cosmos

On 1/27/26 10:17 AM, Peter Zijlstra wrote:
On Tue, Jan 27, 2026 at 11:40:41AM +0100, Peter Zijlstra wrote:
On Fri, Jan 23, 2026 at 12:03:06PM +0100, Peter Zijlstra wrote:
On Fri, Jan 23, 2026 at 11:50:46AM +0100, Peter Zijlstra wrote:
On Sun, Jan 18, 2026 at 03:46:22PM -0500, Mario Roy wrote:
The patch "Proportional newidle balance" introduced a regression
with Linux 6.12.65 and 6.18.5. There is noticeable regression with
easyWave testing. [1]

The CPU is AMD Threadripper 9960X CPU (24/48). I followed the source
to install easyWave [2]. That is fetching the two tar.gz archives.
What is the actual configuration of that chip? Is it like 3*8 or 4*6
(CCX wise). A quick google couldn't find me the answer :/
Obviously I found it right after sending this. It's a 4x6 config.
Meaning it needs newidle to balance between those 4 domains.
So with the below patch on top of my Xeon w7-2495X (which is 24-core
48-thread) I too have 4 LLC :-)

And I think I can see a slight difference, but nowhere near as terrible.

Let me go stick some tracing on.
Does this help some?

Turns out, this easywave thing has a very low newidle rate, but then
also a fairly low success rate. But since it doesn't do it that often,
the cost isn't that significant so we might as well always do it etc..

This adds a second term to the ratio computation that takes time into
account, For low rate newidle this term will dominate, while for higher
rate the success ratio is more important.

Chris, afaict this still DTRT for schbench, but if this works for Mario,
could you also re-run things at your end?

[ the 4 'second' thing is a bit random, but looking at the timings
between easywave and schbench this seems to be a reasonable middle
ground. Although I think 8 'seconds' -- 23 shift -- would also work.

That would give:

1024 - 8 s - 64 Hz
512 - 4 s - 128 Hz
256 - 2 s - 256 Hz
128 - 1 s - 512 Hz
64 - .5 s - 1024 Hz
32 - .25 s - 2048 Hz
]

---

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 45c0022b91ce..a1e1032426dc 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -95,6 +95,7 @@ struct sched_domain {
unsigned int newidle_call;
unsigned int newidle_success;
unsigned int newidle_ratio;
+ u64 newidle_stamp;
u64 max_newidle_lb_cost;
unsigned long last_decay_max_lb_cost;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eca642295c4b..ab9cf06c6a76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12224,8 +12224,31 @@ static inline void update_newidle_stats(struct sched_domain *sd, unsigned int su
sd->newidle_call++;
sd->newidle_success += success;
if (sd->newidle_call >= 1024) {
- sd->newidle_ratio = sd->newidle_success;
+ u64 now = sched_clock();
+ s64 delta = now - sd->newidle_stamp;
+ sd->newidle_stamp = now;
+ int ratio = 0;
+
+ if (delta < 0)
+ delta = 0;
+
+ if (sched_feat(NI_RATE)) {
+ /*
+ * ratio delta freq
+ *
+ * 1024 - 4 s - 128 Hz
+ * 512 - 2 s - 256 Hz
+ * 256 - 1 s - 512 Hz
+ * 128 - .5 s - 1024 Hz
+ * 64 - .25 s - 2048 Hz
+ */
+ ratio = delta >> 22;
+ }
+
+ ratio += sd->newidle_success;
+
+ sd->newidle_ratio = min(1024, ratio);
sd->newidle_call /= 2;
sd->newidle_success /= 2;
}
@@ -12932,7 +12959,7 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
if (sd->flags & SD_BALANCE_NEWIDLE) {
unsigned int weight = 1;
- if (sched_feat(NI_RANDOM)) {
+ if (sched_feat(NI_RANDOM) && sd->newidle_ratio < 1024) {
/*
* Throw a 1k sided dice; and only run
* newidle_balance according to the success
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 980d92bab8ab..7aba7523c6c1 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -126,3 +126,4 @@ SCHED_FEAT(LATENCY_WARN, false)
* Do newidle balancing proportional to its success rate using randomization.
*/
SCHED_FEAT(NI_RANDOM, true)
+SCHED_FEAT(NI_RATE, true)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index cf643a5ddedd..05741f18f334 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -4,6 +4,7 @@
*/
#include <linux/sched/isolation.h>
+#include <linux/sched/clock.h>
#include <linux/bsearch.h>
#include "sched.h"
@@ -1637,6 +1638,7 @@ sd_init(struct sched_domain_topology_level *tl,
struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
int sd_id, sd_weight, sd_flags = 0;
struct cpumask *sd_span;
+ u64 now = sched_clock();
sd_weight = cpumask_weight(tl->mask(tl, cpu));
@@ -1674,6 +1676,7 @@ sd_init(struct sched_domain_topology_level *tl,
.newidle_call = 512,
.newidle_success = 256,
.newidle_ratio = 512,
+ .newidle_stamp = now,
.max_newidle_lb_cost = 0,
.last_decay_max_lb_cost = jiffies,