Re: [PATCH] sched/fair: Call update_util_est() after dequeue_entities()

From: Vincent Guittot

Date: Fri May 15 2026 - 11:23:50 EST

On Fri, 15 May 2026 at 03:51, Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
>
> On 05/13/26 14:46, Vincent Guittot wrote:
> > Le mardi 12 mai 2026 à 13:52:09 (-0700), John Stultz a écrit :
> > > On Tue, May 12, 2026 at 5:47 AM Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
> > > >
> > > > update_util_est() reads task_util() at dequeue which is updated in
> > > > dequeue_entities(). To read the accurate util_avg at dequeue, make sure
> > > > to do the read after load_avg is updated in dequeue_entities().
> > > >
> > > > util_est for a periodic task before
> > > >
> > > > periodic-3114 util_est.enqueued running
> > > > ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
> > > > 183┤ ▖▗ ▐▖ ▖ ▗▙ ▗ ▗▙▖▖ ▖▖ ▖ ▖▖ ▗ ▟ ▗▄▖ │
> > > > 139┤ ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖ │
> > > > 95┤ ▐▀ ▘ ▝ ▝ ▝▘ ▘ ▘▘ ▝▘ ▝▘ ▝ ▝ ▀ │
> > > > │ ▛ │
> > > > 51┤ ▐▘ │
> > > > 7┤ ▖▗▗ ▗▄▐ │
> > > > └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
> > > > 0.00 0.65 1.30 1.96 2.61 3.26 3.91 4.57 5.22 5.87
> > > >
> > > > and after
> > > >
> > > > periodic-2977 util_est.enqueued running
> > > > ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
> > > > 157.0┤ ▙▄ ▗▄ ▗▄▄▄ ▗▄ ▗▄▄▄▗▄▄ ▗▄▄▖ ▄ ▄▄▄ ▄ ▄▖▖ ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄ │
> > > > 119.5┤ ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀ ▝▝▀ ▀ ▀▀▀▀ │
> > > > 82.0┤ ▟ │
> > > > │ ▌ │
> > > > 44.5┤ ▌ │
> > > > 7.0┤ ▗ ▗▖ ▌ │
> > > > └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
> > > > 0.00 0.65 1.30 1.95 2.60 3.25 3.90 4.56 5.21 5.86
> > > >
> > > > Note how the signal is noisier and can peak to 183 vs 157 now.
> > > >
> > > > Fixes: b55945c500c5 ("sched: Fix pick_next_task_fair() vs try_to_wake_up() race")
> > > > Signed-off-by: Qais Yousef <qyousef@xxxxxxxxxxx>
> > > > ---
> > > >
> > > > This is split from [1] series where I stumbled upon this problem. AFAICS it
> > > > needs backporting all the way to 6.12 LTS.
> > > >
> > > > [1] https://lore.kernel.org/lkml/20260504020003.71306-1-qyousef@xxxxxxxxxxx/
> > > >
> > > > kernel/sched/fair.c | 5 ++++-
> > > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 728965851842..96ba97e5f4ae 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -7401,6 +7401,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> > > > */
> > > > static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > > {
> > > > + int ret;
> > > > +
> > > > if (task_is_throttled(p)) {
> > > > dequeue_throttled_task(p, flags);
> > > > return true;
> > > > @@ -7409,8 +7411,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > > if (!p->se.sched_delayed)
> > > > util_est_dequeue(&rq->cfs, p);
> > > >
> > > > + ret = dequeue_entities(rq, &p->se, flags);
> > > > util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
> > > > - if (dequeue_entities(rq, &p->se, flags) < 0)
> > >
> > > Hrm, so the Sashiko tool raised a reasonable concern on the earlier
> > > version of this:
> > > https://sashiko.dev/#/patchset/20260504020003.71306-1-qyousef%40layalina.io?part=12
> >
> > Even without sashiko, this is what the comment says just below in the function ;-)
>
> Yeah I read it but it didn't really register properly :-)
>
> >
> > Below is a different way to fix it which also covers cases when we have both
> > DEQUEUE_SLEEP and DEQUEUE_DELAYED
>
> Works for me. It looks even more stable now
>
> util_avg
>
> periodic-2737 util_avg running
> ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
> 140┤ ▟▟█▄▄██▙▄▟▟▙▙▟█▟▄▟▟▟▄▄██▄▟▙█▙▄▙█▙▄█▙▙▄▙▙▙█▟█▄▟▟▟▄▟█▟▟▄██▄▙█▟▟▄▟██▄ │
> 105┤ ▄██████████████████████████████████████████████████████████████████ │
> 70┤ ▗▛▘ │
> │ ▐▘ │
> 35┤ █ │
> 0┤ ▖ ▗▄▌ │
> └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
> 0.00 0.65 1.30 1.95 2.60 3.25 3.90 4.55 5.20 5.85
>
> and util_eset
>
> periodic-2737 util_est.enqueued running
> ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
> 140.0┤ ▞▀▀▀▀▀▀▀▀▀▀▀▀▀▜▀▀▀▜▀▀▀▀▀▀▀▀▜▀▀▀▀▀▀▀▀▀▀▀▀▀▀▜▀▀▀▀▀▀▜▛▀▀▀▛▀▀▀▀▀▀▀▀▀▘ │
> 106.8┤ ▝▘ │
> 73.5┤ ▗▘ │
> │ ▗ │
> 40.2┤ ▖ │
> 7.0┤ ▖ ▄▗▖ │
> └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
> 0.00 0.65 1.30 1.95 2.60 3.25 3.90 4.55 5.20 5.85
>
> Do you want to send a proper patch? Feel free to stick my
> reviewed-and-tested-by.

Yes I'm going to send a proper patch with reported, reviewed and tested by You

Thanks

>
> Thanks!
>
> >
> > ---
> > kernel/sched/fair.c | 189 ++++++++++++++++++++++----------------------
> > 1 file changed, 93 insertions(+), 96 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c83fbe4e88c1..0976adc12594 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5120,13 +5120,87 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > trace_pelt_cfs_tp(cfs_rq);
> > }
> >
> > +#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
> > +
> > +static inline void util_est_update(struct sched_entity *se)
> > +{
> > + unsigned int ewma, dequeued, last_ewma_diff;
> > +
> > + if (!sched_feat(UTIL_EST))
> > + return;
> > +
> > + /* Get current estimate of utilization */
> > + ewma = READ_ONCE(se->avg.util_est);
> > +
> > + /*
> > + * If the PELT values haven't changed since enqueue time,
> > + * skip the util_est update.
> > + */
> > + if (ewma & UTIL_AVG_UNCHANGED)
> > + return;
> > +
> > + /* Get utilization at dequeue */
> > + dequeued = READ_ONCE(se->avg.util_avg);
> > +
> > + /*
> > + * Reset EWMA on utilization increases, the moving average is used only
> > + * to smooth utilization decreases.
> > + */
> > + if (ewma <= dequeued) {
> > + ewma = dequeued;
> > + goto done;
> > + }
> > +
> > + /*
> > + * Skip update of task's estimated utilization when its members are
> > + * already ~1% close to its last activation value.
> > + */
> > + last_ewma_diff = ewma - dequeued;
> > + if (last_ewma_diff < UTIL_EST_MARGIN)
> > + goto done;
> > +
> > + /*
> > + * To avoid underestimate of task utilization, skip updates of EWMA if
> > + * we cannot grant that thread got all CPU time it wanted.
> > + */
> > + if ((dequeued + UTIL_EST_MARGIN) < READ_ONCE(se->avg.runnable_avg))
> > + goto done;
> > +
> > +
> > + /*
> > + * Update Task's estimated utilization
> > + *
> > + * When *p completes an activation we can consolidate another sample
> > + * of the task size. This is done by using this value to update the
> > + * Exponential Weighted Moving Average (EWMA):
> > + *
> > + * ewma(t) = w * task_util(p) + (1-w) * ewma(t-1)
> > + * = w * task_util(p) + ewma(t-1) - w * ewma(t-1)
> > + * = w * (task_util(p) - ewma(t-1)) + ewma(t-1)
> > + * = w * ( -last_ewma_diff ) + ewma(t-1)
> > + * = w * (-last_ewma_diff + ewma(t-1) / w)
> > + *
> > + * Where 'w' is the weight of new samples, which is configured to be
> > + * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
> > + */
> > + ewma <<= UTIL_EST_WEIGHT_SHIFT;
> > + ewma -= last_ewma_diff;
> > + ewma >>= UTIL_EST_WEIGHT_SHIFT;
> > +done:
> > + ewma |= UTIL_AVG_UNCHANGED;
> > + WRITE_ONCE(se->avg.util_est, ewma);
> > +
> > + trace_sched_util_est_se_tp(se);
> > +}
> > +
> > /*
> > * Optional action to be done while updating the load average
> > */
> > -#define UPDATE_TG 0x1
> > -#define SKIP_AGE_LOAD 0x2
> > -#define DO_ATTACH 0x4
> > -#define DO_DETACH 0x8
> > +#define UPDATE_TG 0x01
> > +#define SKIP_AGE_LOAD 0x02
> > +#define DO_ATTACH 0x04
> > +#define DO_DETACH 0x08
> > +#define UPDATE_UTIL_EST 0x10
> >
> > /* Update task and its cfs_rq load average */
> > static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > @@ -5169,6 +5243,9 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > if (flags & UPDATE_TG)
> > update_tg_load_avg(cfs_rq);
> > }
> > +
> > + if (flags & UPDATE_UTIL_EST)
> > + util_est_update(se);
> > }
> >
> > /*
> > @@ -5227,11 +5304,6 @@ static inline unsigned long task_util(struct task_struct *p)
> > return READ_ONCE(p->se.avg.util_avg);
> > }
> >
> > -static inline unsigned long task_runnable(struct task_struct *p)
> > -{
> > - return READ_ONCE(p->se.avg.runnable_avg);
> > -}
> > -
> > static inline unsigned long _task_util_est(struct task_struct *p)
> > {
> > return READ_ONCE(p->se.avg.util_est) & ~UTIL_AVG_UNCHANGED;
> > @@ -5274,88 +5346,6 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
> > trace_sched_util_est_cfs_tp(cfs_rq);
> > }
> >
> > -#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
> > -
> > -static inline void util_est_update(struct cfs_rq *cfs_rq,
> > - struct task_struct *p,
> > - bool task_sleep)
> > -{
> > - unsigned int ewma, dequeued, last_ewma_diff;
> > -
> > - if (!sched_feat(UTIL_EST))
> > - return;
> > -
> > - /*
> > - * Skip update of task's estimated utilization when the task has not
> > - * yet completed an activation, e.g. being migrated.
> > - */
> > - if (!task_sleep)
> > - return;
> > -
> > - /* Get current estimate of utilization */
> > - ewma = READ_ONCE(p->se.avg.util_est);
> > -
> > - /*
> > - * If the PELT values haven't changed since enqueue time,
> > - * skip the util_est update.
> > - */
> > - if (ewma & UTIL_AVG_UNCHANGED)
> > - return;
> > -
> > - /* Get utilization at dequeue */
> > - dequeued = task_util(p);
> > -
> > - /*
> > - * Reset EWMA on utilization increases, the moving average is used only
> > - * to smooth utilization decreases.
> > - */
> > - if (ewma <= dequeued) {
> > - ewma = dequeued;
> > - goto done;
> > - }
> > -
> > - /*
> > - * Skip update of task's estimated utilization when its members are
> > - * already ~1% close to its last activation value.
> > - */
> > - last_ewma_diff = ewma - dequeued;
> > - if (last_ewma_diff < UTIL_EST_MARGIN)
> > - goto done;
> > -
> > - /*
> > - * To avoid underestimate of task utilization, skip updates of EWMA if
> > - * we cannot grant that thread got all CPU time it wanted.
> > - */
> > - if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
> > - goto done;
> > -
> > -
> > - /*
> > - * Update Task's estimated utilization
> > - *
> > - * When *p completes an activation we can consolidate another sample
> > - * of the task size. This is done by using this value to update the
> > - * Exponential Weighted Moving Average (EWMA):
> > - *
> > - * ewma(t) = w * task_util(p) + (1-w) * ewma(t-1)
> > - * = w * task_util(p) + ewma(t-1) - w * ewma(t-1)
> > - * = w * (task_util(p) - ewma(t-1)) + ewma(t-1)
> > - * = w * ( -last_ewma_diff ) + ewma(t-1)
> > - * = w * (-last_ewma_diff + ewma(t-1) / w)
> > - *
> > - * Where 'w' is the weight of new samples, which is configured to be
> > - * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
> > - */
> > - ewma <<= UTIL_EST_WEIGHT_SHIFT;
> > - ewma -= last_ewma_diff;
> > - ewma >>= UTIL_EST_WEIGHT_SHIFT;
> > -done:
> > - ewma |= UTIL_AVG_UNCHANGED;
> > - WRITE_ONCE(p->se.avg.util_est, ewma);
> > -
> > - trace_sched_util_est_se_tp(&p->se);
> > -}
> > -
> > static inline unsigned long get_actual_cpu_capacity(int cpu)
> > {
> > unsigned long capacity = arch_scale_cpu_capacity(cpu);
> > @@ -5828,7 +5818,7 @@ static bool
> > dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > {
> > bool sleep = flags & DEQUEUE_SLEEP;
> > - int action = UPDATE_TG;
> > + int action = 0;
> >
> > update_curr(cfs_rq);
> > clear_buddies(cfs_rq, se);
> > @@ -5848,15 +5838,23 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >
> > if (sched_feat(DELAY_DEQUEUE) && delay &&
> > !entity_eligible(cfs_rq, se)) {
> > - update_load_avg(cfs_rq, se, 0);
> > + if (entity_is_task(se))
> > + action |= UPDATE_UTIL_EST;
> > + update_load_avg(cfs_rq, se, action);
> > update_entity_lag(cfs_rq, se);
> > set_delayed(se);
> > return false;
> > }
> > }
> >
> > - if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
> > - action |= DO_DETACH;
> > + action = UPDATE_TG;
> > + if (entity_is_task(se)) {
> > + if (task_on_rq_migrating(task_of(se)))
> > + action |= DO_DETACH;
> > +
> > + if (sleep && !(flags & DEQUEUE_DELAYED))
> > + action |= UPDATE_UTIL_EST;
> > + }
> >
> > /*
> > * When dequeuing a sched_entity, we must:
> > @@ -7628,7 +7626,6 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > if (!p->se.sched_delayed)
> > util_est_dequeue(&rq->cfs, p);
> >
> > - util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
> > if (dequeue_entities(rq, &p->se, flags) < 0)
> > return false;
> >
> > --
> > 2.43.0
> >
> >
> >
> >
> > >
> > > Specifically, we shouldn't reference p after dequeue_entities() or we
> > > risk racing with it being woken up, running, and maybe exiting on
> > > another cpu.
> > > And this moves the util_est_updat() call to after dequeue finishes.
> > >
> > > Maybe there's some way to have util_est_update() compensate for the
> > > unfinished accounting that will be done in dequeue_entities()?
> > >
> > > thanks
> > > -john