Re: [PATCH] sched/fair: Skip cpus with no sched domain attached during NOHZ idle balance

From: Pierre Gondois
Date: Fri Sep 08 2023 - 05:44:24 EST


Hello Rui,

On 8/14/23 10:30, Zhang, Rui wrote:
On Mon, 2023-08-14 at 11:14 +0800, Aaron Lu wrote:
Hi Rui,

On Fri, Aug 04, 2023 at 05:08:58PM +0800, Zhang Rui wrote:
Problem statement
-----------------
When using cgroup isolated partition to isolate cpus including
cpu0, it
is observed that cpu0 is woken up frequenctly but doing nothing.
This is
not good for power efficiency.

<idle>-0     [000]   616.491602: hrtimer_cancel:
hrtimer=0xffff8e8fdf623c10
<idle>-0     [000]   616.491608: hrtimer_start:
hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
expires=615996000000 softexpires=615996000000
<idle>-0     [000]   616.491616: rcu_utilization:      Start
context switch
<idle>-0     [000]   616.491618: rcu_utilization:      End context
switch
<idle>-0     [000]   616.491637: tick_stop:            success=1
dependency=NONE
<idle>-0     [000]   616.491637: hrtimer_cancel:
hrtimer=0xffff8e8fdf623c10
<idle>-0     [000]   616.491638: hrtimer_start:
hrtimer=0xffff8e8fdf623c10 function=tick_sched_timer/0x0
expires=616420000000 softexpires=616420000000

The above pattern repeats every one or multiple ticks, results in
total
2000+ wakeups on cpu0 in 60 seconds, when running workload on the
cpus that are not in the isolated partition.

Rootcause
---------
In NOHZ mode, an active cpu either sends an IPI or touches the idle
cpu's polling flag to wake it up, so that the idle cpu can pull
tasks
from the busy cpu. The logic for selecting the target cpu is to use
the
first idle cpu that presents in both nohz.idle_cpus_mask and
housekeeping_cpumask.

In the above scenario, when cpu0 is in the cgroup isolated
partition,
its sched domain is deteched, but it is still available in both of
the
above cpumasks. As a result, cpu0

I saw in nohz_balance_enter_idle(), if a cpu is isolated, it will not
set itself in nohz.idle_cpus_mask and thus should not be chosen as
ilb_cpu. I wonder what's stopping this from working?

One thing I forgot to mention is that the problem is gone if we offline
and re-online those cpus. In that case, the isolated cpus are removed
from the nohz.idle_cpus_mask in sched_cpu_deactivate() and are never
added back.

At runtime, the cpus can be removed from the nohz.idle_cpus_mask in
below case only
trigger_load_balance()
if (unlikely(on_null_domain(rq) || !cpu_active(cpu_of(rq))))
return;
nohz_balancer_kick(rq);
nohz_balance_exit_idle()

My understanding is that if a cpu is in nohz.idle_cpus_mask when it is
isolated, there is no chance to remove it from that mask later, so the
check in nohz_balance_enter_idle() does not help.


As you mentioned, once a CPU is isolated, its rq->nohz_tick_stopped is
never updated. Instead of avoiding isolated CPUs at each find_new_ilb() call
as this patch does, maybe it would be better to permanently remove these CPUs
from nohz.idle_cpus_mask. This would lower the number of checks done.
This could be done with something like:
@@ -11576,6 +11586,20 @@ void nohz_balance_enter_idle(int cpu)
*/
rq->has_blocked_load = 1;
+ /* If we're a completely isolated CPU, we don't play: */
+ if (on_null_domain(rq)) {
+ if (cpumask_test_cpu(rq->cpu, nohz.idle_cpus_mask)) {
+ cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask);
+ atomic_dec(&nohz.nr_cpus);
+ }
+
+ /*
+ * Set nohz_tick_stopped on isolated CPUs to avoid keeping the
+ * value that was stored when rebuild_sched_domains() was called.
+ */
+ rq->nohz_tick_stopped = 1;
+ }
+
/*
* The tick is still stopped but load could have been added in the
* meantime. We set the nohz.has_blocked flag to trig a check of the
@@ -11585,10 +11609,6 @@ void nohz_balance_enter_idle(int cpu)
if (rq->nohz_tick_stopped)
goto out;
- /* If we're a completely isolated CPU, we don't play: */
- if (on_null_domain(rq))
- return;
-
rq->nohz_tick_stopped = 1;
cpumask_set_cpu(cpu, nohz.idle_cpus_mask);

Otherwise I could reproduce the issue and the patch was solving it, so:
Tested-by: Pierre Gondois <pierre.gondois@xxxxxxx>





Also, your patch doesn't aim to solve that, but I think there is an issue
when updating cpuset.cpus when an isolated partition was already created:

// Create an isolated partition containing CPU0
# mkdir cgroup
# mount -t cgroup2 none cgroup/
# mkdir cgroup/Testing
# echo "+cpuset" > cgroup/cgroup.subtree_control
# echo "+cpuset" > cgroup/Testing/cgroup.subtree_control
# echo 0 > cgroup/Testing/cpuset.cpus
# echo isolated > cgroup/Testing/cpuset.cpus.partition

// CPU0's sched domain is detached:
# ls /sys/kernel/debug/sched/domains/cpu0/
# ls /sys/kernel/debug/sched/domains/cpu1/
domain0 domain1

// Change the isolated partition to be CPU1
# echo 1 > cgroup/Testing/cpuset.cpus

// CPU[0-1] sched domains are not updated:
# ls /sys/kernel/debug/sched/domains/cpu0/
# ls /sys/kernel/debug/sched/domains/cpu1/
domain0 domain1

Regards,
Pierre