Re: [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer

From: Andrea Righi

Date: Mon Mar 30 2026 - 13:32:11 EST

On Fri, Mar 27, 2026 at 05:04:23PM +0530, K Prateek Nayak wrote:
> Hello Andrea,
>
> On 3/27/2026 3:14 PM, Andrea Righi wrote:
> > Hi Vincent,
> >
> > On Fri, Mar 27, 2026 at 09:45:56AM +0100, Vincent Guittot wrote:
> >> On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@xxxxxxxxxx> wrote:
> >>>
> >>> When choosing which idle housekeeping CPU runs the idle load balancer,
> >>> prefer one on a fully idle core if SMT is active, so balance can migrate
> >>> work onto a CPU that still offers full effective capacity. Fall back to
> >>> any idle candidate if none qualify.
> >>
> >> This one isn't straightforward for me. The ilb cpu will check all
> >> other idle CPUs 1st and finish with itself so unless the next CPU in
> >> the idle_cpus_mask is a sibling, this should not make a difference
> >>
> >> Did you see any perf diff ?
> >
> > I actually see a benefit, in particular, with the first patch applied I see
> > a ~1.76x speedup, if I add this on top I get ~1.9x speedup vs baseline,
> > which seems pretty consistent across runs (definitely not in error range).
> >
> > The intention with this change was to minimize SMT noise running the ILB
> > code on a fully-idle core when possible, but I also didn't expect to see
> > such big difference.
> >
> > I'll investigate more to better understand what's happening.
>
> Interesting! Either this "CPU-intensive workload" hates SMT turning
> busy (but to an extent where performance drops visibly?) or ILB
> keeps getting interrupted on an SMT sibling that is burdened by
> interrupts leading to slower balance (or IRQs driving the workload
> being delayed by rq_lock disabling them)

Alright, I dug a bit deeper into what's going on.

In this case, the workload showing the large benefit (the NVBLAS benchmark)
is running exactly one task per SMT core, all pinned to NUMA node 0. The
system has two nodes, so node 1 remains mostly idle.

With the SMT-aware select_idle_capacity(), tasks get distributed across SMT
cores in a way that avoids placing them on busy siblings, which is nice and
it's the part that gives most of the speedup.

However, without this ILB patch, find_new_ilb() always picks a CPU with a
busy sibling on node 0, because for_each_cpu_and() always starts from the
lower CPU IDs. As a result, the ILB always ends up running on CPUs with a
CPU-intensive worker running on its sibling, disrupting each other's
performance.

As an experiment, I tried something silly like the following, biasing the
ILB selection toward node 1 (node0 = 0-87,176-263, node1 = 88-177,264-351):

struct cpumask tmp;

cpumask_and(&tmp, nohz.idle_cpus_mask, hk_mask);
for_each_cpu_wrap(ilb_cpu, &tmp, nr_cpu_ids / 4) {
if (ilb_cpu == smp_processor_id())
continue;

if (idle_cpu(ilb_cpu))
return ilb_cpu;
}

And I get pretty much the same speedup (slighly better actually, because I
always get an idle CPU in one step, since node 1 is always idle with this
particular benchmark).

So, in this particular scenario this patch makes sense, because we
avoid the "SMT contention" at very low cost. In general, I think the
benefit can be quite situational. I could still make sense to have it, the
extra overhead is limited to an additional is_core_idle() check over idle &
HK candidates (worst case), which could be worthwhile if it reduces
interference from busy SMT siblings.

What do you think?

Thanks,
-Andrea