Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance

From: Aaron Lu
Date: Mon Sep 11 2023 - 18:06:48 EST


Hi Pierre,

On Fri, Sep 08, 2023 at 11:43:50AM +0200, Pierre Gondois wrote:
> Hello Rui,
>
> On 8/14/23 10:30, Zhang, Rui wrote:
> > On Mon, 2023-08-14 at 11:14 +0800, Aaron Lu wrote:
> > > Hi Rui,
> > >
> > > On Fri, Aug 04, 2023 at 05:08:58PM +0800, Zhang Rui wrote:
> > > > Problem statement
> > > > -----------------
> > > > When using cgroup isolated partition to isolate cpus including
> > > > cpu0, it
> > > > is observed that cpu0 is woken up frequenctly but doing nothing.
> > > > This is
> > > > not good for power efficiency.
> > > >
> > > > <idle>-0     [000]   616.491602: hrtimer_cancel:
> > > > hrtimer=0xffff8e8fdf623c10
> > > > <idle>-0     [000]   616.491608: hrtimer_start:
> > > > hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
> > > > expires=615996000000 softexpires=615996000000
> > > > <idle>-0     [000]   616.491616: rcu_utilization:      Start
> > > > context switch
> > > > <idle>-0     [000]   616.491618: rcu_utilization:      End context
> > > > switch
> > > > <idle>-0     [000]   616.491637: tick_stop:            success=1
> > > > dependency=NONE
> > > > <idle>-0     [000]   616.491637: hrtimer_cancel:
> > > > hrtimer=0xffff8e8fdf623c10
> > > > <idle>-0     [000]   616.491638: hrtimer_start:
> > > > hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
> > > > expires=616420000000 softexpires=616420000000
> > > >
> > > > The above pattern repeats every one or multiple ticks, results in
> > > > total
> > > > 2000+ wakeups on cpu0 in 60 seconds, when running workload on the
> > > > cpus that are not in the isolated partition.
> > > >
> > > > Rootcause
> > > > ---------
> > > > In NOHZ mode, an active cpu either sends an IPI or touches the idle
> > > > cpu's polling flag to wake it up, so that the idle cpu can pull
> > > > tasks
> > > > from the busy cpu. The logic for selecting the target cpu is to use
> > > > the
> > > > first idle cpu that presents in both nohz.idle_cpus_mask and
> > > > housekeeping_cpumask.
> > > >
> > > > In the above scenario, when cpu0 is in the cgroup isolated
> > > > partition,
> > > > its sched domain is deteched, but it is still available in both of
> > > > the
> > > > above cpumasks. As a result, cpu0
> > >
> > > I saw in nohz_balance_enter_idle(), if a cpu is isolated, it will not
> > > set itself in nohz.idle_cpus_mask and thus should not be chosen as
> > > ilb_cpu. I wonder what's stopping this from working?
> >
> > One thing I forgot to mention is that the problem is gone if we offline
> > and re-online those cpus. In that case, the isolated cpus are removed
> > from the nohz.idle_cpus_mask in sched_cpu_deactivate() and are never
> > added back.
> >
> > At runtime, the cpus can be removed from the nohz.idle_cpus_mask in
> > below case only
> > trigger_load_balance()
> > if (unlikely(on_null_domain(rq) || !cpu_active(cpu_of(rq))))
> > return;
> > nohz_balancer_kick(rq);
> > nohz_balance_exit_idle()
> >
> > My understanding is that if a cpu is in nohz.idle_cpus_mask when it is
> > isolated, there is no chance to remove it from that mask later, so the
> > check in nohz_balance_enter_idle() does not help.
>
>
> As you mentioned, once a CPU is isolated, its rq->nohz_tick_stopped is
> never updated. Instead of avoiding isolated CPUs at each find_new_ilb() call
> as this patch does, maybe it would be better to permanently remove these CPUs
> from nohz.idle_cpus_mask. This would lower the number of checks done.

I agree.

> This could be done with something like:
> @@ -11576,6 +11586,20 @@ void nohz_balance_enter_idle(int cpu)
> */
> rq->has_blocked_load = 1;
> + /* If we're a completely isolated CPU, we don't play: */
> + if (on_null_domain(rq)) {
> + if (cpumask_test_cpu(rq->cpu, nohz.idle_cpus_mask)) {
> + cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask);
> + atomic_dec(&nohz.nr_cpus);
> + }
> +


> + /*
> + * Set nohz_tick_stopped on isolated CPUs to avoid keeping the
> + * value that was stored when rebuild_sched_domains() was called.
> + */
> + rq->nohz_tick_stopped = 1;

It's not immediately clear to me what the above comment and code mean,
maybe that's because I know very little about sched domain related code.
Other than this, your change looks good to me.

Thanks,
Aaron

> + }
> +
> /*
> * The tick is still stopped but load could have been added in the
> * meantime. We set the nohz.has_blocked flag to trig a check of the
> @@ -11585,10 +11609,6 @@ void nohz_balance_enter_idle(int cpu)
> if (rq->nohz_tick_stopped)
> goto out;
> - /* If we're a completely isolated CPU, we don't play: */
> - if (on_null_domain(rq))
> - return;
> -
> rq->nohz_tick_stopped = 1;
> cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
>
> Otherwise I could reproduce the issue and the patch was solving it, so:
> Tested-by: Pierre Gondois <pierre.gondois@xxxxxxx>
>
>
>
>
>
> Also, your patch doesn't aim to solve that, but I think there is an issue
> when updating cpuset.cpus when an isolated partition was already created:
>
> // Create an isolated partition containing CPU0
> # mkdir cgroup
> # mount -t cgroup2 none cgroup/
> # mkdir cgroup/Testing
> # echo "+cpuset" > cgroup/cgroup.subtree_control
> # echo "+cpuset" > cgroup/Testing/cgroup.subtree_control
> # echo 0 > cgroup/Testing/cpuset.cpus
> # echo isolated > cgroup/Testing/cpuset.cpus.partition
>
> // CPU0's sched domain is detached:
> # ls /sys/kernel/debug/sched/domains/cpu0/
> # ls /sys/kernel/debug/sched/domains/cpu1/
> domain0 domain1
>
> // Change the isolated partition to be CPU1
> # echo 1 > cgroup/Testing/cpuset.cpus
>
> // CPU[0-1] sched domains are not updated:
> # ls /sys/kernel/debug/sched/domains/cpu0/
> # ls /sys/kernel/debug/sched/domains/cpu1/
> domain0 domain1
>
> Regards,
> Pierre
>