Re: [PATCH v2] sched/fair: use rq_clock() in update_tg_load_avg() rate-limit
From: Vincent Guittot
Date: Wed May 27 2026 - 10:34:38 EST
On Wed, 27 May 2026 at 15:59, Rik van Riel <riel@xxxxxxxxxxx> wrote:
>
> update_tg_load_avg() is called once per leaf cfs_rq from the
> __update_blocked_fair() walk that runs inside the NOHZ idle-balance
> softirq, and again from update_load_avg() with UPDATE_TG. Its first
> operation after the trivial early-outs is unconditionally:
>
> now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
> return;
>
> Jakub ran into a system where nohz_idle_balance() was taking 75%
> of a CPU (which is handling network traffic and doing many irq_exit_cpu
> calls), with 35% of that CPU spent in update_load_avg, and 17% of the
> CPU in sched_clock_cpu(), reading the TSC.
>
> In a quick synthetic test, it looks like this patch reduces the
> CPU use of sched_balance_update_blocked_averages by about 20%.
>
> Switch the rate-limit to read rq_clock(rq_of(cfs_rq)) instead.
> This eliminates the rdtsc, and uses a fairly fresh timestamp,
> because all callers of update_tg_load_avg() and clear_tg_load_avg()
> hold rq->lock and have called update_rq_clock(rq) within microseconds:
>
> caller pre-state
> __update_blocked_fair encloser did update_rq_clock(rq)
> update_load_avg's three UPDATE_TG sites under rq->lock after enqueue/dequeue/update_curr
> attach_/detach_entity_cfs_rq preceded by update_load_avg(...)
> clear_tg_load_avg via offline path rq_clock_start_loop_update(rq) upfront
>
> so rq->clock is fresh at every call. Since cfs_rqs are per-CPU
> per-task_group, cfs_rq->last_update_tg_load_avg is always compared
> against the same rq's clock; no cross-rq drift.
>
> The same hoisting pattern was recently applied to find_new_ilb() in
> commit 76504bce4ee6 ("sched/fair: Get this cpu once in find_new_ilb()").
Please remove this 2 lines above which seem to be out of scope. If you
really want to refer to a commit, it would be better to mention the
one that introduced update_rq_clock()
>
> Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx>
> Assisted-by: Claude (Anthropic)
Besides the minor comment above
Reviewed-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> ---
> kernel/sched/fair.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3ebec186f982..37b39bdf5ef9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4393,7 +4393,7 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> * For migration heavy workloads, access to tg->load_avg can be
> * unbound. Limit the update rate to at most once per ms.
> */
> - now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> + now = rq_clock(rq_of(cfs_rq));
> if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
> return;
>
> @@ -4416,7 +4416,7 @@ static inline void clear_tg_load_avg(struct cfs_rq *cfs_rq)
> if (cfs_rq->tg == &root_task_group)
> return;
>
> - now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> + now = rq_clock(rq_of(cfs_rq));
> delta = 0 - cfs_rq->tg_load_avg_contrib;
> atomic_long_add(delta, &cfs_rq->tg->load_avg);
> cfs_rq->tg_load_avg_contrib = 0;
> --
> 2.54.0
>
>