Re: [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer

From: Andrea Righi

Date: Fri Mar 27 2026 - 18:45:36 EST

On Fri, Mar 27, 2026 at 09:36:15PM +0100, Andrea Righi wrote:
> On Fri, Mar 27, 2026 at 05:04:23PM +0530, K Prateek Nayak wrote:
> > Hello Andrea,
> >
> > On 3/27/2026 3:14 PM, Andrea Righi wrote:
> > > Hi Vincent,
> > >
> > > On Fri, Mar 27, 2026 at 09:45:56AM +0100, Vincent Guittot wrote:
> > >> On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@xxxxxxxxxx> wrote:
> > >>>
> > >>> When choosing which idle housekeeping CPU runs the idle load balancer,
> > >>> prefer one on a fully idle core if SMT is active, so balance can migrate
> > >>> work onto a CPU that still offers full effective capacity. Fall back to
> > >>> any idle candidate if none qualify.
> > >>
> > >> This one isn't straightforward for me. The ilb cpu will check all
> > >> other idle CPUs 1st and finish with itself so unless the next CPU in
> > >> the idle_cpus_mask is a sibling, this should not make a difference
> > >>
> > >> Did you see any perf diff ?
> > >
> > > I actually see a benefit, in particular, with the first patch applied I see
> > > a ~1.76x speedup, if I add this on top I get ~1.9x speedup vs baseline,
> > > which seems pretty consistent across runs (definitely not in error range).
> > >
> > > The intention with this change was to minimize SMT noise running the ILB
> > > code on a fully-idle core when possible, but I also didn't expect to see
> > > such big difference.
> > >
> > > I'll investigate more to better understand what's happening.
> >
> > Interesting! Either this "CPU-intensive workload" hates SMT turning
> > busy (but to an extent where performance drops visibly?) or ILB
> > keeps getting interrupted on an SMT sibling that is burdened by
> > interrupts leading to slower balance (or IRQs driving the workload
> > being delayed by rq_lock disabling them)
> >
> > Would it be possible to share the total SCHED_SOFTIRQ time, load
> > balancing attempts, and utlization with and without the patch? I too
> > will go queue up some runs to see if this makes a difference.
>
> Quick update: I also tried this on a Vera machine with a firmware that
> exposes the same capacity for all the CPUs (so with SD_ASYM_CPUCAPACITY
> disabled and SMT still on of course) and I see similar performance
> benefits.
>
> Looking at SCHED_SOFTIRQ and load balancing attempts I don't see big
> differences, all within error range (results produced using a vibe-coded
> python script):
>
> - baseline (stats/sec):
>
> SCHED softirq count : 2,625
> LB attempts (total) : 69,832
>
> Per-domain breakdown:
> domain0 (SMT):
> lb_count (total) : 68,482 [balanced=68,472 failed=9]
> CPU_IDLE : lb=1,408 imb(load=0 util=0 task=0 misfit=0) gained=0
> CPU_NEWLY_IDLE : lb=67,041 imb(load=0 util=0 task=7 misfit=0) gained=0
> CPU_NOT_IDLE : lb=33 imb(load=0 util=0 task=2 misfit=0) gained=0
> domain1 (MC):
> lb_count (total) : 902 [balanced=900 failed=2]
> CPU_NEWLY_IDLE : lb=869 imb(load=0 util=0 task=0 misfit=0) gained=0
> CPU_NOT_IDLE : lb=33 imb(load=0 util=0 task=2 misfit=0) gained=0
> domain2 (NUMA):
> lb_count (total) : 448 [balanced=441 failed=7]
> CPU_NEWLY_IDLE : lb=415 imb(load=0 util=0 task=44 misfit=0) gained=0
> CPU_NOT_IDLE : lb=33 imb(load=0 util=0 task=268 misfit=0) gained=0
>
> - with ilb-smt (stats/sec):
>
> SCHED softirq count : 2,671
> LB attempts (total) : 68,572
>
> Per-domain breakdown:
> domain0 (SMT):
> lb_count (total) : 67,239 [balanced=67,197 failed=41]
> CPU_IDLE : lb=1,419 imb(load=0 util=0 task=0 misfit=0) gained=0
> CPU_NEWLY_IDLE : lb=65,783 imb(load=0 util=0 task=42 misfit=0) gained=1
> CPU_NOT_IDLE : lb=37 imb(load=0 util=0 task=0 misfit=0) gained=0
> domain1 (MC):
> lb_count (total) : 833 [balanced=833 failed=0]
> CPU_NEWLY_IDLE : lb=796 imb(load=0 util=0 task=0 misfit=0) gained=0
> CPU_NOT_IDLE : lb=37 imb(load=0 util=0 task=0 misfit=0) gained=0
> domain2 (NUMA):
> lb_count (total) : 500 [balanced=488 failed=12]
> CPU_NEWLY_IDLE : lb=463 imb(load=0 util=0 task=44 misfit=0) gained=0
> CPU_NOT_IDLE : lb=37 imb(load=0 util=0 task=627 misfit=0) gained=0
>
> I'll add more direct instrumentation to check what ILB is doing
> differently...

More data.

== SMT contention ==

tracepoint:sched:sched_switch
{
if (args->next_pid != 0) {
@busy[cpu] = 1;
} else {
delete(@busy[cpu]);
}
}

tracepoint:sched:sched_switch
/ args->prev_pid == 0 && args->next_pid != 0 /
{
$sib = (cpu + 176) % 352;

if (@busy[$sib]) {
@smt_contention++;
} else {
@smt_no_contention++;
}
}

END
{
printf("smt_contention %lld\n", (int64)@smt_contention);
printf("smt_no_contention %lld\n", (int64)@smt_no_contention);
}

- baseline:

@smt_contention: 1103
@smt_no_contention: 3815

- ilb-smt:

@smt_contention: 937
@smt_no_contention: 4459

== ILB duration ==

- baseline:

@ilb_duration_us:
[0] 147 | |
[1] 354 |@ |
[2, 4) 739 |@@@ |
[4, 8) 3040 |@@@@@@@@@@@@@@@@ |
[8, 16) 9825 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32) 8142 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[32, 64) 1267 |@@@@@@ |
[64, 128) 1607 |@@@@@@@@ |
[128, 256) 2222 |@@@@@@@@@@@ |
[256, 512) 2326 |@@@@@@@@@@@@ |
[512, 1K) 141 | |
[1K, 2K) 37 | |
[2K, 4K) 7 |

- ilb-smt:

@ilb_duration_us:
[0] 79 | |
[1] 137 | |
[2, 4) 1440 |@@@@@@@@@@ |
[4, 8) 2897 |@@@@@@@@@@@@@@@@@@@@ |
[8, 16) 7433 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32) 4993 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[32, 64) 2390 |@@@@@@@@@@@@@@@@ |
[64, 128) 2254 |@@@@@@@@@@@@@@@ |
[128, 256) 2731 |@@@@@@@@@@@@@@@@@@@ |
[256, 512) 1083 |@@@@@@@ |
[512, 1K) 265 |@ |
[1K, 2K) 29 | |
[2K, 4K) 5 | |

== rq_lock hold ==

- baseline:

@lb_rqlock_hold_us:
[0] 664396 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1] 77446 |@@@@@@ |
[2, 4) 25044 |@ |
[4, 8) 19847 |@ |
[8, 16) 2434 | |
[16, 32) 605 | |
[32, 64) 308 | |
[64, 128) 38 | |
[128, 256) 2 | |

- ilb-smt:

@lb_rqlock_hold_us:
[0] 229152 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1] 135060 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[2, 4) 26989 |@@@@@@ |
[4, 8) 48034 |@@@@@@@@@@ |
[8, 16) 1919 | |
[16, 32) 2236 | |
[32, 64) 595 | |
[64, 128) 135 | |
[128, 256) 27 | |

For what I see ILB runs are more expensive, but I still don't see why I'm
getting the speedup with this ilb-smt patch. I'll keep investigating...

-Andrea