RE: [PATCH] SCHED: scatter nohz idle balance target cpus

From: Jianyong Wu
Date: Wed Mar 19 2025 - 05:04:23 EST




> -----Original Message-----
> From: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> Sent: Wednesday, March 19, 2025 4:46 PM
> To: Jianyong Wu <wujianyong@xxxxxxxx>
> Cc: mingo@xxxxxxxxxx; peterz@xxxxxxxxxxxxx; jianyong.wu@xxxxxxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH] SCHED: scatter nohz idle balance target cpus
>
> On Tue, 18 Mar 2025 at 03:27, Jianyong Wu <wujianyong@xxxxxxxx> wrote:
> >
> > Currently, cpu selection logic for nohz idle balance lacks history
> > info that leads to cpu0 is always chosen if it's in nohz cpu mask.
> > It's not fair fot the tasks reside in numa node0. It's worse in the
> > machine with large cpu number, nohz idle balance may be very heavy.
>
> Could you provide more details about why it's not fair for tasks that reside on
> numa node 0 ? cpu0 is idle so ilb doesn't steal time to other tasks.
>
> Do you have figures or use cases to highlight this unfairness ?
>
[Jianyong Wu]
Yeah, here is a test case.
In a system with a large number of CPUs (in my scenario, there are 256 CPUs), when the entire system is under a low load, if you try to bind two or more CPU - bound jobs to a single CPU other than CPU0, you'll notice that the softirq utilization for CPU0 can reach approximately 10%, while it remains negligible for other CPUs. By checking the /proc/softirqs file, it becomes evident that a significant number of SCHED softirqs are only executed on CPU0.
> >
> > To address this issue, adding a member to "nohz" to indicate who is
> > chosen last time and choose next for this round of nohz idle balance.
> >
> > Signed-off-by: Jianyong Wu <wujianyong@xxxxxxxx>
> > ---
> > kernel/sched/fair.c | 9 ++++++---
> > 1 file changed, 6 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index
> > c798d2795243..ba6930c79e25 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7197,6 +7197,7 @@ static struct {
> > atomic_t nr_cpus;
> > int has_blocked; /* Idle CPUS has blocked load
> */
> > int needs_update; /* Newly idle CPUs need their
> next_balance collated */
> > + int last_cpu; /* Last cpu chosen to do nohz
> idle balance */
> > unsigned long next_balance; /* in jiffy units */
> > unsigned long next_blocked; /* Next update of blocked load in
> jiffies */
> > } nohz ____cacheline_aligned;
> > @@ -12266,13 +12267,15 @@ static inline int find_new_ilb(void)
> >
> > hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
> >
> > - for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
> > + for_each_cpu_wrap(ilb_cpu, nohz.idle_cpus_mask, nohz.last_cpu
> > + + 1) {
> >
> > - if (ilb_cpu == smp_processor_id())
> > + if (ilb_cpu == smp_processor_id() ||
> > + !cpumask_test_cpu(ilb_cpu, hk_mask))
> > continue;
> >
> > - if (idle_cpu(ilb_cpu))
> > + if (idle_cpu(ilb_cpu)) {
> > + nohz.last_cpu = ilb_cpu;
> > return ilb_cpu;
> > + }
> > }
> >
> > return -1;
> > --
> > 2.43.0
> >