Re: [PATCH] sched/balance: Skip unnecessary updates to idle load balancer's flags

From: Vincent Guittot
Date: Tue Jun 04 2024 - 10:39:03 EST


On Fri, 31 May 2024 at 22:52, Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx> wrote:
>
> We observed that the overhead on trigger_load_balance(), now renamed
> sched_balance_trigger(), has risen with a system's core counts.
>
> For an OLTP workload running 6.8 kernel on a 2 socket x86 systems
> having 96 cores/socket, we saw that 0.7% cpu cycles are spent in
> trigger_load_balance(). On older systems with fewer cores/socket, this
> function's overhead was less than 0.1%.
>
> The cause of this overhead was that there are multiple cpus calling
> kick_ilb(flags), updating the balancing work needed to a common idle
> load balancer cpu. The ilb_cpu's flags field got updated unconditionally
> with atomic_fetch_or(). The atomic read and writes to ilb_cpu's flags
> causes much cache bouncing and cpu cycles overhead. This is seen in the
> annotated profile below.
>
> kick_ilb():
> if (ilb_cpu < 0)
> test %r14d,%r14d
> ↑ js 6c
> flags = atomic_fetch_or(flags, nohz_flags(ilb_cpu));
> mov $0x2d600,%rdi
> movslq %r14d,%r8
> mov %rdi,%rdx
> add -0x7dd0c3e0(,%r8,8),%rdx
> arch_atomic_read():
> 0.01 mov 0x64(%rdx),%esi
> 35.58 add $0x64,%rdx
> arch_atomic_fetch_or():
>
> static __always_inline int arch_atomic_fetch_or(int i, atomic_t *v)
> {
> int val = arch_atomic_read(v);
>
> do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i));
> 0.03 157: mov %r12d,%ecx
> arch_atomic_try_cmpxchg():
> return arch_try_cmpxchg(&v->counter, old, new);
> 0.00 mov %esi,%eax
> arch_atomic_fetch_or():
> do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i));
> or %esi,%ecx
> arch_atomic_try_cmpxchg():
> return arch_try_cmpxchg(&v->counter, old, new);
> 0.01 lock cmpxchg %ecx,(%rdx)
> 42.96 ↓ jne 2d2
> kick_ilb():
>
> With instrumentation, we found that 81% of the updates do not result in
> any change in the ilb_cpu's flags. That is, multiple cpus are asking
> the ilb_cpu to do the same things over and over again, before the ilb_cpu
> has a chance to run NOHZ load balance.
>
> Skip updates to ilb_cpu's flags if no new work needs to be done.
> Such updates do not change ilb_cpu's NOHZ flags. This requires an extra
> atomic read but it is less expensive than frequent unnecessary atomic
> updates that generate cache bounces.
>
> We saw that on the OLTP workload, cpu cycles from trigger_load_balance()
> (or sched_balance_trigger()) got reduced from 0.7% to 0.2%.

Make sense, we have seen other variables being a bottleneck in the
scheduler like task_group's load_avg or root domain's overload.

Reviewed-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>

>
> Signed-off-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
> ---
> kernel/sched/fair.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a5b1ae0aa55..9ab6dff6d8ac 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11891,6 +11891,13 @@ static void kick_ilb(unsigned int flags)
> if (ilb_cpu < 0)
> return;
>
> + /*
> + * Don't bother if no new NOHZ balance work items for ilb_cpu,
> + * i.e. all bits in flags are already set in ilb_cpu.
> + */
> + if ((atomic_read(nohz_flags(ilb_cpu)) & flags) == flags)
> + return;
> +
> /*
> * Access to rq::nohz_csd is serialized by NOHZ_KICK_MASK; he who sets
> * the first flag owns it; cleared by nohz_csd_func().
> --
> 2.32.0
>