RE: [PATCH] SCHED: scatter nohz idle balance target cpus
From: Jianyong Wu
Date: Wed Mar 19 2025 - 05:51:42 EST
> -----Original Message-----
> From: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> Sent: Wednesday, March 19, 2025 5:26 PM
> To: Jianyong Wu <wujianyong@xxxxxxxx>
> Cc: mingo@xxxxxxxxxx; peterz@xxxxxxxxxxxxx; jianyong.wu@xxxxxxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH] SCHED: scatter nohz idle balance target cpus
>
> On Wed, 19 Mar 2025 at 10:03, Jianyong Wu <wujianyong@xxxxxxxx> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> > > Sent: Wednesday, March 19, 2025 4:46 PM
> > > To: Jianyong Wu <wujianyong@xxxxxxxx>
> > > Cc: mingo@xxxxxxxxxx; peterz@xxxxxxxxxxxxx; jianyong.wu@xxxxxxxxxxx;
> > > linux-kernel@xxxxxxxxxxxxxxx
> > > Subject: Re: [PATCH] SCHED: scatter nohz idle balance target cpus
> > >
> > > On Tue, 18 Mar 2025 at 03:27, Jianyong Wu <wujianyong@xxxxxxxx> wrote:
> > > >
> > > > Currently, cpu selection logic for nohz idle balance lacks history
> > > > info that leads to cpu0 is always chosen if it's in nohz cpu mask.
> > > > It's not fair fot the tasks reside in numa node0. It's worse in
> > > > the machine with large cpu number, nohz idle balance may be very heavy.
> > >
> > > Could you provide more details about why it's not fair for tasks
> > > that reside on numa node 0 ? cpu0 is idle so ilb doesn't steal time to other
> tasks.
> > >
> > > Do you have figures or use cases to highlight this unfairness ?
> > >
> > [Jianyong Wu]
> > Yeah, here is a test case.
> > In a system with a large number of CPUs (in my scenario, there are 256 CPUs),
> when the entire system is under a low load, if you try to bind two or more CPU -
> bound jobs to a single CPU other than CPU0, you'll notice that the softirq
> utilization for CPU0 can reach approximately 10%, while it remains negligible for
> other CPUs. By checking the /proc/softirqs file, it becomes evident that a
> significant number of SCHED softirqs are only executed on CPU0.
>
> yes, but this 10% of softirq time would have been idle time otherwise so why
> should we care ?
>
[Jianyong Wu]
However, this value is proportional to the number of CPUs. In the event that tasks are scheduled to CPU0, delays will occur.
> At the opposite, if all your cpus are busy, then cpu0 will do the busy load balance
> of parents sched domain level whereas other cpus will not and this time is stolen
> to the task running on CPU0 which could be unfair compared to task running on
> other CPUS
>
[Jianyong Wu]
Yeah. Do you think this patch can mitigate this?
>
> > > >
> > > > To address this issue, adding a member to "nohz" to indicate who
> > > > is chosen last time and choose next for this round of nohz idle balance.
> > > >
> > > > Signed-off-by: Jianyong Wu <wujianyong@xxxxxxxx>
> > > > ---
> > > > kernel/sched/fair.c | 9 ++++++---
> > > > 1 file changed, 6 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index
> > > > c798d2795243..ba6930c79e25 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -7197,6 +7197,7 @@ static struct {
> > > > atomic_t nr_cpus;
> > > > int has_blocked; /* Idle CPUS has blocked
> load
> > > */
> > > > int needs_update; /* Newly idle CPUs need
> their
> > > next_balance collated */
> > > > + int last_cpu; /* Last cpu chosen to do
> nohz
> > > idle balance */
> > > > unsigned long next_balance; /* in jiffy units */
> > > > unsigned long next_blocked; /* Next update of blocked
> load in
> > > jiffies */
> > > > } nohz ____cacheline_aligned;
> > > > @@ -12266,13 +12267,15 @@ static inline int find_new_ilb(void)
> > > >
> > > > hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
> > > >
> > > > - for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
> > > > + for_each_cpu_wrap(ilb_cpu, nohz.idle_cpus_mask,
> > > > + nohz.last_cpu
> > > > + + 1) {
> > > >
> > > > - if (ilb_cpu == smp_processor_id())
> > > > + if (ilb_cpu == smp_processor_id() ||
> > > > + !cpumask_test_cpu(ilb_cpu, hk_mask))
> > > > continue;
> > > >
> > > > - if (idle_cpu(ilb_cpu))
> > > > + if (idle_cpu(ilb_cpu)) {
> > > > + nohz.last_cpu = ilb_cpu;
> > > > return ilb_cpu;
> > > > + }
> > > > }
> > > >
> > > > return -1;
> > > > --
> > > > 2.43.0
> > > >