Re: [PATCH] SCHED: scatter nohz idle balance target cpus
From: Vincent Guittot
Date: Wed Mar 19 2025 - 05:58:24 EST
On Wed, 19 Mar 2025 at 10:42, Jianyong Wu <wujianyong@xxxxxxxx> wrote:
>
>
>
> > -----Original Message-----
> > From: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> > Sent: Wednesday, March 19, 2025 5:26 PM
> > To: Jianyong Wu <wujianyong@xxxxxxxx>
> > Cc: mingo@xxxxxxxxxx; peterz@xxxxxxxxxxxxx; jianyong.wu@xxxxxxxxxxx;
> > linux-kernel@xxxxxxxxxxxxxxx
> > Subject: Re: [PATCH] SCHED: scatter nohz idle balance target cpus
> >
> > On Wed, 19 Mar 2025 at 10:03, Jianyong Wu <wujianyong@xxxxxxxx> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> > > > Sent: Wednesday, March 19, 2025 4:46 PM
> > > > To: Jianyong Wu <wujianyong@xxxxxxxx>
> > > > Cc: mingo@xxxxxxxxxx; peterz@xxxxxxxxxxxxx; jianyong.wu@xxxxxxxxxxx;
> > > > linux-kernel@xxxxxxxxxxxxxxx
> > > > Subject: Re: [PATCH] SCHED: scatter nohz idle balance target cpus
> > > >
> > > > On Tue, 18 Mar 2025 at 03:27, Jianyong Wu <wujianyong@xxxxxxxx> wrote:
> > > > >
> > > > > Currently, cpu selection logic for nohz idle balance lacks history
> > > > > info that leads to cpu0 is always chosen if it's in nohz cpu mask.
> > > > > It's not fair fot the tasks reside in numa node0. It's worse in
> > > > > the machine with large cpu number, nohz idle balance may be very heavy.
> > > >
> > > > Could you provide more details about why it's not fair for tasks
> > > > that reside on numa node 0 ? cpu0 is idle so ilb doesn't steal time to other
> > tasks.
> > > >
> > > > Do you have figures or use cases to highlight this unfairness ?
> > > >
> > > [Jianyong Wu]
> > > Yeah, here is a test case.
> > > In a system with a large number of CPUs (in my scenario, there are 256 CPUs),
> > when the entire system is under a low load, if you try to bind two or more CPU -
> > bound jobs to a single CPU other than CPU0, you'll notice that the softirq
> > utilization for CPU0 can reach approximately 10%, while it remains negligible for
> > other CPUs. By checking the /proc/softirqs file, it becomes evident that a
> > significant number of SCHED softirqs are only executed on CPU0.
> >
> > yes, but this 10% of softirq time would have been idle time otherwise so why
> > should we care ?
> >
> [Jianyong Wu]
> However, this value is proportional to the number of CPUs. In the event that tasks are scheduled to CPU0, delays will occur.
ILB aborts when a task is enqueued on the CPU so the 10% softirq does
not reflect the delay of an enqueued task. It would be good to get
figures about the problem you are trying to solve
>
> > At the opposite, if all your cpus are busy, then cpu0 will do the busy load balance
> > of parents sched domain level whereas other cpus will not and this time is stolen
> > to the task running on CPU0 which could be unfair compared to task running on
> > other CPUS
> >
> [Jianyong Wu]
> Yeah. Do you think this patch can mitigate this?
> >
> > > > >
> > > > > To address this issue, adding a member to "nohz" to indicate who
> > > > > is chosen last time and choose next for this round of nohz idle balance.
> > > > >
> > > > > Signed-off-by: Jianyong Wu <wujianyong@xxxxxxxx>
> > > > > ---
> > > > > kernel/sched/fair.c | 9 ++++++---
> > > > > 1 file changed, 6 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index
> > > > > c798d2795243..ba6930c79e25 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -7197,6 +7197,7 @@ static struct {
> > > > > atomic_t nr_cpus;
> > > > > int has_blocked; /* Idle CPUS has blocked
> > load
> > > > */
> > > > > int needs_update; /* Newly idle CPUs need
> > their
> > > > next_balance collated */
> > > > > + int last_cpu; /* Last cpu chosen to do
> > nohz
> > > > idle balance */
> > > > > unsigned long next_balance; /* in jiffy units */
> > > > > unsigned long next_blocked; /* Next update of blocked
> > load in
> > > > jiffies */
> > > > > } nohz ____cacheline_aligned;
> > > > > @@ -12266,13 +12267,15 @@ static inline int find_new_ilb(void)
> > > > >
> > > > > hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
> > > > >
> > > > > - for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
> > > > > + for_each_cpu_wrap(ilb_cpu, nohz.idle_cpus_mask,
> > > > > + nohz.last_cpu
> > > > > + + 1) {
> > > > >
> > > > > - if (ilb_cpu == smp_processor_id())
> > > > > + if (ilb_cpu == smp_processor_id() ||
> > > > > + !cpumask_test_cpu(ilb_cpu, hk_mask))
> > > > > continue;
> > > > >
> > > > > - if (idle_cpu(ilb_cpu))
> > > > > + if (idle_cpu(ilb_cpu)) {
> > > > > + nohz.last_cpu = ilb_cpu;
> > > > > return ilb_cpu;
> > > > > + }
> > > > > }
> > > > >
> > > > > return -1;
> > > > > --
> > > > > 2.43.0
> > > > >