Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance

From: Zhang, Rui
Date: Mon Sep 11 2023 - 18:11:01 EST


On Mon, 2023-09-11 at 19:42 +0800, Aaron Lu wrote:
> Hi Pierre,
>
> On Fri, Sep 08, 2023 at 11:43:50AM +0200, Pierre Gondois wrote:
> > Hello Rui,
> >
> > On 8/14/23 10:30, Zhang, Rui wrote:
> > > On Mon, 2023-08-14 at 11:14 +0800, Aaron Lu wrote:
> > > > Hi Rui,
> > > >
> > > > On Fri, Aug 04, 2023 at 05:08:58PM +0800, Zhang Rui wrote:
> > > > > Problem statement
> > > > > -----------------
> > > > > When using cgroup isolated partition to isolate cpus
> > > > > including
> > > > > cpu0, it
> > > > > is observed that cpu0 is woken up frequenctly but doing
> > > > > nothing.
> > > > > This is
> > > > > not good for power efficiency.
> > > > >
> > > > > <idle>-0     [000]   616.491602: hrtimer_cancel:
> > > > > hrtimer=0xffff8e8fdf623c10
> > > > > <idle>-0     [000]   616.491608: hrtimer_start:
> > > > > hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
> > > > > expires=615996000000 softexpires=615996000000
> > > > > <idle>-0     [000]   616.491616: rcu_utilization:      Start
> > > > > context switch
> > > > > <idle>-0     [000]   616.491618: rcu_utilization:      End
> > > > > context
> > > > > switch
> > > > > <idle>-0     [000]   616.491637: tick_stop:           
> > > > > success=1
> > > > > dependency=NONE
> > > > > <idle>-0     [000]   616.491637: hrtimer_cancel:
> > > > > hrtimer=0xffff8e8fdf623c10
> > > > > <idle>-0     [000]   616.491638: hrtimer_start:
> > > > > hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
> > > > > expires=616420000000 softexpires=616420000000
> > > > >
> > > > > The above pattern repeats every one or multiple ticks,
> > > > > results in
> > > > > total
> > > > > 2000+ wakeups on cpu0 in 60 seconds, when running workload on
> > > > > the
> > > > > cpus that are not in the isolated partition.
> > > > >
> > > > > Rootcause
> > > > > ---------
> > > > > In NOHZ mode, an active cpu either sends an IPI or touches
> > > > > the idle
> > > > > cpu's polling flag to wake it up, so that the idle cpu can
> > > > > pull
> > > > > tasks
> > > > > from the busy cpu. The logic for selecting the target cpu is
> > > > > to use
> > > > > the
> > > > > first idle cpu that presents in both nohz.idle_cpus_mask and
> > > > > housekeeping_cpumask.
> > > > >
> > > > > In the above scenario, when cpu0 is in the cgroup isolated
> > > > > partition,
> > > > > its sched domain is deteched, but it is still available in
> > > > > both of
> > > > > the
> > > > > above cpumasks. As a result, cpu0
> > > >
> > > > I saw in nohz_balance_enter_idle(), if a cpu is isolated, it
> > > > will not
> > > > set itself in nohz.idle_cpus_mask and thus should not be chosen
> > > > as
> > > > ilb_cpu. I wonder what's stopping this from working?
> > >
> > > One thing I forgot to mention is that the problem is gone if we
> > > offline
> > > and re-online those cpus. In that case, the isolated cpus
> > > are removed
> > > from the nohz.idle_cpus_mask in sched_cpu_deactivate() and are
> > > never
> > > added back.
> > >
> > > At runtime, the cpus can be removed from the nohz.idle_cpus_mask
> > > in
> > > below case only
> > >         trigger_load_balance()
> > >                 if (unlikely(on_null_domain(rq) ||
> > > !cpu_active(cpu_of(rq))))
> > >                         return;
> > >                 nohz_balancer_kick(rq);
> > >                         nohz_balance_exit_idle()
> > >
> > > My understanding is that if a cpu is in nohz.idle_cpus_mask when
> > > it is
> > > isolated, there is no chance to remove it from that mask later,
> > > so the
> > > check in nohz_balance_enter_idle() does not help.
> >
> >
> > As you mentioned, once a CPU is isolated, its rq->nohz_tick_stopped
> > is
> > never updated. Instead of avoiding isolated CPUs at each
> > find_new_ilb() call
> > as this patch does, maybe it would be better to permanently remove
> > these CPUs
> > from nohz.idle_cpus_mask. This would lower the number of checks
> > done.
>
> I agree.

Sounds reasonable to me.

>
> > This could be done with something like:
> > @@ -11576,6 +11586,20 @@ void nohz_balance_enter_idle(int cpu)
> >           */
> >          rq->has_blocked_load = 1;
> > +       /* If we're a completely isolated CPU, we don't play: */
> > +       if (on_null_domain(rq)) {
> > +               if (cpumask_test_cpu(rq->cpu, nohz.idle_cpus_mask))
> > {
> > +                       cpumask_clear_cpu(rq->cpu,
> > nohz.idle_cpus_mask);
> > +                       atomic_dec(&nohz.nr_cpus);
> > +               }
> > +
>
>
> > +               /*
> > +                * Set nohz_tick_stopped on isolated CPUs to avoid
> > keeping the
> > +                * value that was stored when
> > rebuild_sched_domains() was called.
> > +                */
> > +               rq->nohz_tick_stopped = 1;
>
> It's not immediately clear to me what the above comment and code
> mean,
> maybe that's because I know very little about sched domain related
> code.
> Other than this, your change looks good to me.
>

If we set rq->nohz_tick_stopped for the isolated cpu, next time when we
invoke nohz_balance_exit_idle(), we clear rq->nohz_tick_stopped and
decrease nohz.nr_cpus (for the second time). This seems like a problem.

In the current code, when rq->nohz_tick_stopped is set, it means the
cpu is in the nohz.idle_cpus_mask, and the code above breaks this
assumption.

>
> > +       }
> > +
> >          /*
> >           * The tick is still stopped but load could have been
> > added in the
> >           * meantime. We set the nohz.has_blocked flag to trig a
> > check of the
> > @@ -11585,10 +11609,6 @@ void nohz_balance_enter_idle(int cpu)
> >          if (rq->nohz_tick_stopped)
> >                  goto out;
> > -       /* If we're a completely isolated CPU, we don't play: */
> > -       if (on_null_domain(rq))
> > -               return;
> > -
> >          rq->nohz_tick_stopped = 1;
> >          cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
> >
> > Otherwise I could reproduce the issue and the patch was solving it,
> > so:
> > Tested-by: Pierre Gondois <pierre.gondois@xxxxxxx>

Thanks for testing, really appreciated!
> >
> >
> >
> >
> >
> > Also, your patch doesn't aim to solve that, but I think there is an
> > issue
> > when updating cpuset.cpus when an isolated partition was already
> > created:
> >
> > // Create an isolated partition containing CPU0
> > # mkdir cgroup
> > # mount -t cgroup2 none cgroup/
> > # mkdir cgroup/Testing
> > # echo "+cpuset" > cgroup/cgroup.subtree_control
> > # echo "+cpuset" > cgroup/Testing/cgroup.subtree_control
> > # echo 0 > cgroup/Testing/cpuset.cpus
> > # echo isolated > cgroup/Testing/cpuset.cpus.partition
> >
> > // CPU0's sched domain is detached:
> > # ls /sys/kernel/debug/sched/domains/cpu0/
> > # ls /sys/kernel/debug/sched/domains/cpu1/
> > domain0  domain1
> >
> > // Change the isolated partition to be CPU1
> > # echo 1 > cgroup/Testing/cpuset.cpus
> >
> > // CPU[0-1] sched domains are not updated:
> > # ls /sys/kernel/debug/sched/domains/cpu0/
> > # ls /sys/kernel/debug/sched/domains/cpu1/
> > domain0  domain1
> >
Interesting. Let me check and get back to you later on this. :)

thanks,
rui