Re: [PATCH] SCHED: scatter nohz idle balance target cpus

From: Vincent Guittot
Date: Wed Mar 19 2025 - 05:26:47 EST


On Wed, 19 Mar 2025 at 10:03, Jianyong Wu <wujianyong@xxxxxxxx> wrote:
>
>
>
> > -----Original Message-----
> > From: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> > Sent: Wednesday, March 19, 2025 4:46 PM
> > To: Jianyong Wu <wujianyong@xxxxxxxx>
> > Cc: mingo@xxxxxxxxxx; peterz@xxxxxxxxxxxxx; jianyong.wu@xxxxxxxxxxx;
> > linux-kernel@xxxxxxxxxxxxxxx
> > Subject: Re: [PATCH] SCHED: scatter nohz idle balance target cpus
> >
> > On Tue, 18 Mar 2025 at 03:27, Jianyong Wu <wujianyong@xxxxxxxx> wrote:
> > >
> > > Currently, cpu selection logic for nohz idle balance lacks history
> > > info that leads to cpu0 is always chosen if it's in nohz cpu mask.
> > > It's not fair fot the tasks reside in numa node0. It's worse in the
> > > machine with large cpu number, nohz idle balance may be very heavy.
> >
> > Could you provide more details about why it's not fair for tasks that reside on
> > numa node 0 ? cpu0 is idle so ilb doesn't steal time to other tasks.
> >
> > Do you have figures or use cases to highlight this unfairness ?
> >
> [Jianyong Wu]
> Yeah, here is a test case.
> In a system with a large number of CPUs (in my scenario, there are 256 CPUs), when the entire system is under a low load, if you try to bind two or more CPU - bound jobs to a single CPU other than CPU0, you'll notice that the softirq utilization for CPU0 can reach approximately 10%, while it remains negligible for other CPUs. By checking the /proc/softirqs file, it becomes evident that a significant number of SCHED softirqs are only executed on CPU0.

yes, but this 10% of softirq time would have been idle time otherwise
so why should we care ?

At the opposite, if all your cpus are busy, then cpu0 will do the busy
load balance of parents sched domain level whereas other cpus will not
and this time is stolen to the task running on CPU0 which could be
unfair compared to task running on other CPUS


> > >
> > > To address this issue, adding a member to "nohz" to indicate who is
> > > chosen last time and choose next for this round of nohz idle balance.
> > >
> > > Signed-off-by: Jianyong Wu <wujianyong@xxxxxxxx>
> > > ---
> > > kernel/sched/fair.c | 9 ++++++---
> > > 1 file changed, 6 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index
> > > c798d2795243..ba6930c79e25 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7197,6 +7197,7 @@ static struct {
> > > atomic_t nr_cpus;
> > > int has_blocked; /* Idle CPUS has blocked load
> > */
> > > int needs_update; /* Newly idle CPUs need their
> > next_balance collated */
> > > + int last_cpu; /* Last cpu chosen to do nohz
> > idle balance */
> > > unsigned long next_balance; /* in jiffy units */
> > > unsigned long next_blocked; /* Next update of blocked load in
> > jiffies */
> > > } nohz ____cacheline_aligned;
> > > @@ -12266,13 +12267,15 @@ static inline int find_new_ilb(void)
> > >
> > > hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
> > >
> > > - for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
> > > + for_each_cpu_wrap(ilb_cpu, nohz.idle_cpus_mask, nohz.last_cpu
> > > + + 1) {
> > >
> > > - if (ilb_cpu == smp_processor_id())
> > > + if (ilb_cpu == smp_processor_id() ||
> > > + !cpumask_test_cpu(ilb_cpu, hk_mask))
> > > continue;
> > >
> > > - if (idle_cpu(ilb_cpu))
> > > + if (idle_cpu(ilb_cpu)) {
> > > + nohz.last_cpu = ilb_cpu;
> > > return ilb_cpu;
> > > + }
> > > }
> > >
> > > return -1;
> > > --
> > > 2.43.0
> > >