Re: [PATCH] sched/balance: Skip unnecessary updates to idle load balancer's flags

From: Chen Yu
Date: Sun Jun 02 2024 - 12:40:45 EST


On 2024-05-31 at 13:54:52 -0700, Tim Chen wrote:
> We observed that the overhead on trigger_load_balance(), now renamed
> sched_balance_trigger(), has risen with a system's core counts.
>
> For an OLTP workload running 6.8 kernel on a 2 socket x86 systems
> having 96 cores/socket, we saw that 0.7% cpu cycles are spent in
> trigger_load_balance(). On older systems with fewer cores/socket, this
> function's overhead was less than 0.1%.
>
> The cause of this overhead was that there are multiple cpus calling
> kick_ilb(flags), updating the balancing work needed to a common idle
> load balancer cpu. The ilb_cpu's flags field got updated unconditionally
> with atomic_fetch_or(). The atomic read and writes to ilb_cpu's flags
> causes much cache bouncing and cpu cycles overhead. This is seen in the
> annotated profile below.
>
> kick_ilb():
> if (ilb_cpu < 0)
> test %r14d,%r14d
> ↑ js 6c
> flags = atomic_fetch_or(flags, nohz_flags(ilb_cpu));
> mov $0x2d600,%rdi
> movslq %r14d,%r8
> mov %rdi,%rdx
> add -0x7dd0c3e0(,%r8,8),%rdx
> arch_atomic_read():
> 0.01 mov 0x64(%rdx),%esi
> 35.58 add $0x64,%rdx
> arch_atomic_fetch_or():
>
> static __always_inline int arch_atomic_fetch_or(int i, atomic_t *v)
> {
> int val = arch_atomic_read(v);
>
> do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i));
> 0.03 157: mov %r12d,%ecx
> arch_atomic_try_cmpxchg():
> return arch_try_cmpxchg(&v->counter, old, new);
> 0.00 mov %esi,%eax
> arch_atomic_fetch_or():
> do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i));
> or %esi,%ecx
> arch_atomic_try_cmpxchg():
> return arch_try_cmpxchg(&v->counter, old, new);
> 0.01 lock cmpxchg %ecx,(%rdx)
> 42.96 ↓ jne 2d2
> kick_ilb():
>
> With instrumentation, we found that 81% of the updates do not result in
> any change in the ilb_cpu's flags. That is, multiple cpus are asking
> the ilb_cpu to do the same things over and over again, before the ilb_cpu
> has a chance to run NOHZ load balance.
>
> Skip updates to ilb_cpu's flags if no new work needs to be done.
> Such updates do not change ilb_cpu's NOHZ flags. This requires an extra
> atomic read but it is less expensive than frequent unnecessary atomic
> updates that generate cache bounces.

A race condition is that many CPUs choose the same ilb_cpu and ask it to trigger
the nohz idle balance. This is because find_new_ilb() always finds the first
nohz idle CPU. I wonder if we could change the
for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask)
into
for_each_cpu_wrap(ilb_cpu, cpumask_and(nohz.idle_cpus_mask, hk_mask), this_cpu+1)
so different ilb_cpu might be found by different CPUs.
Then the extra atomic read could brings less cache bounces.

>
> We saw that on the OLTP workload, cpu cycles from trigger_load_balance()
> (or sched_balance_trigger()) got reduced from 0.7% to 0.2%.
>
> Signed-off-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
> ---
> kernel/sched/fair.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a5b1ae0aa55..9ab6dff6d8ac 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11891,6 +11891,13 @@ static void kick_ilb(unsigned int flags)
> if (ilb_cpu < 0)
> return;
>
> + /*
> + * Don't bother if no new NOHZ balance work items for ilb_cpu,
> + * i.e. all bits in flags are already set in ilb_cpu.
> + */
> + if ((atomic_read(nohz_flags(ilb_cpu)) & flags) == flags)

Maybe also mention in the comment that when above statement is true, the
current ilb_cpu's flags is not 0 and in NOHZ_KICK_MASK, so return directly
here is safe(anyway just 2 cents)

Reviewed-by: Chen Yu <yu.c.chen@xxxxxxxxx>

thanks,
Chenyu

> + return;
> +
> /*
> * Access to rq::nohz_csd is serialized by NOHZ_KICK_MASK; he who sets
> * the first flag owns it; cleared by nohz_csd_func().
> --
> 2.32.0
>