Re: [PATCH v2] sched: fair: Prevent negative lag increase during delayed dequeue

From: Vincent Guittot

Date: Fri Apr 03 2026 - 04:38:33 EST

On Thu, 2 Apr 2026 at 21:27, Shubhang Kaushik
<shubhang@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> Hi Vincent,
>
> I have been testing your v2 patch on my 80 core Ampere Altra (ARMv8
> Neoverse-N1) 1P system using an idle tickless kernel on the latest
> tip/sched/core branch.
>
> On Tue, 31 Mar 2026, Vincent Guittot wrote:
>
> > Delayed dequeue feature aims to reduce the negative lag of a dequeued task
> > while sleeping but it can happens that newly enqueued tasks will move
> > backward the avg vruntime and increase its negative lag.
> > When the delayed dequeued task wakes up, it has more neg lag compared to
> > being dequeued immediately or to other tasks that have been dequeued just
> > before theses new enqueues.
> >
> > Ensure that the negative lag of a delayed dequeued task doesn't increase
> > during its delayed dequeued phase while waiting for its neg lag to
> > diseappear. Similarly, we remove any positive lag that the delayed
> > dequeued task could have gain during thsi period.
> >
> > Short slice tasks are particularly impacted in overloaded system.
> >
> > Test on snapdragon rb5:
> >
> > hackbench -T -p -l 16000000 -g 2 1> /dev/null &
> > cyclictest -t 1 -i 2777 -D 333 --policy=fair --mlock -h 20000 -q
> >
> > The scheduling latency of cyclictest is:
> >
> > tip/sched/core tip/sched/core +this patch
> > cyclictest slice (ms) (default)2.8 8 8
> > hackbench slice (ms) (default)2.8 20 20
> > Total Samples | 115632 119733 119806
> > Average (us) | 364 64(-82%) 61(- 5%)
> > Median (P50) (us) | 60 56(- 7%) 56( 0%)
> > 90th Percentile (us) | 1166 62(-95%) 62( 0%)
> > 99th Percentile (us) | 4192 73(-98%) 72(- 1%)
> > 99.9th Percentile (us) | 8528 2707(-68%) 1300(-52%)
> > Maximum (us) | 17735 14273(-20%) 13525(- 5%)
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> > ---
> >
>
> I replicated this cyclictest environment scaled for 80 cores using a
> background hackbench load (-g 20). On Ampere Altra, I did not see the
> tail latency reduction that you observed on the 8-core Snapdragon. In
> fact, both average and max latencies increased slightly.
>
> Metric | Baseline | Patched | Delta (%)
> -------------|----------|----------|-----------
> Max Latency | 9141us | 9426us | +3.11%
> Avg Latency | 206us | 217us | +5.33%
> Min Latency | 14us | 13us | -7.14%

Without setting a shorter custom slice for cyclictest, you will not
see any major differences. The difference ehappens in the p99 and
p99.9 with a shorter slice

>
> More concerning is the impact on throughput. At 8-16 threads, hackbench
> execution times increased by ~30%. I attempted to isolate this by

Hmm, I run some perf test and I haven't seen any difference for
hackbench with various number of group

> disabling the DELAY_DEQUEUE sched_feature. But the regression persists
> even with NO_DELAY_DEQUEUE, pointing to overhead in the modified
> update_entity_lag() path itself.
>
> Test Case | Baseline | Patched | Delta (%) | Patched(NO_DELAYDQ)

By baseline, do you mean tip/sched/core or v7.0-rcx ?

> -------------|----------|----------|-----------|--------------------
> 4 Threads | 13.77s | 17.53s | +27.3% | 17.16s
> 8 Threads | 24.39s | 31.90s | +30.8% | 30.67s
> 16 Threads | 47.92s | 60.46s | +26.2% | 62.53s
> 32 Processes | 118.08s | 103.16s | -12.6% | 101.87s

That's surprising. I ran some perf tests with the patch and haven't
seen any differences

tip/sched/core + patch
hackbench 1 process socket 0,581 0,580 (0,0 %)
stddev 2,7 % 2,5 %
hackbench 4 process socket 0,612 0,612 (0,0 %)
stddev 0,9 % 2,3 %
hackbench 8 process socket 0,662 0,659 (0,4 %)
stddev 1,0 % 1,8 %
hackbench 16 process socket 0,700 0,699 (0,3 %)
stddev 1,6 % 1,3 %
hackbench 1 process pipe 0,796 0,797 (-0,2 %)
stddev 1,5 % 1,9 %
hackbench 4 process pipe 0,699 0,694 (0,8 %)
stddev 3,7 % 2,5 %
hackbench 8 process pipe 0,631 0,636 (-0,9 %)
stddev 3,4 % 2,2 %
hackbench 16 process pipe 0,612 0,594 (2,9 %)
stddev 1,8 % 1,5 %
hackbench 1 thread socket 0,571 0,570 (0,1 %)
stddev 2,3 % 1,5 %
hackbench 4 thread socket 0,591 0,594 (-0,5 %)
stddev 1,2 % 0,7 %
hackbench 8 thread socket 0,621 0,628 (-1,2 %)
stddev 1,3 % 1,4 %
hackbench 16 thread socket 0,660 0,653 (1,0 %)
stddev 0,7 % 0,9 %
hackbench 1 thread pipe 0,860 0,864 (-0,6 %)
stddev 1,4 % 2,0 %
hackbench 4 thread pipe 0,828 0,821 (0,9 %)
stddev 3,5 % 4,7 %
hackbench 8 thread pipe 0,725 0,739 (-1,8 %)
stddev 2,3 % 8,6 %
hackbench 16 thread pipe 0,647 0,645 (0,4 %)
stddev 4,3 % 4,2 %

>
> > Since v1:
> > - Embedded the check of lag evolution of delayed dequeue entities in
> > update_entity_lag() to include all cases.
> >
>
> While the patch shows a ~12.6% improvement at high saturation (32
> processes), the throughput cost at mid-range scales appears to outweigh
> the fairness benefits on our high core system, as even the worst-case
> wake-up latencies did not improve.
>
> > kernel/sched/fair.c | 53 ++++++++++++++++++++++++++-------------------
> > 1 file changed, 31 insertions(+), 22 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 226509231e67..c1ffe86bf78d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -840,11 +840,30 @@ static s64 entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 avrunt
> > return clamp(vlag, -limit, limit);
> > }
> >
> > -static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > +/*
> > + * Delayed dequeue aims to reduce the negative lag of a dequeued task.
> > + * While updating the lag of an entity, check that negative lag didn't increase
> > + * during the delayed dequeue period which would be unfair.
> > + * Similarly, check that the entity didn't gain positive lag when DELAY_ZERO is
> > + * set.
> > + *
> > + * Return true if the lag has been adjusted.
> > + */
> > +static bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > {
> > + s64 vlag;
> > +
> > WARN_ON_ONCE(!se->on_rq);
> >
> > - se->vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq));
> > + vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq));
> > +
> > + if (se->sched_delayed)
> > + /* previous vlag < 0 otherwise se would not be delayed */
> > + se->vlag = clamp(vlag, se->vlag, sched_feat(DELAY_ZERO) ? 0 : S64_MAX);
> > + else
> > + se->vlag = vlag;
> > +
> > + return (vlag != se->vlag);
> > }
> >
> > /*
> > @@ -5563,13 +5582,6 @@ static void clear_delayed(struct sched_entity *se)
> > }
> > }
> >
> > -static inline void finish_delayed_dequeue_entity(struct sched_entity *se)
> > -{
> > - clear_delayed(se);
> > - if (sched_feat(DELAY_ZERO) && se->vlag > 0)
> > - se->vlag = 0;
> > -}
> > -
> > static bool
> > dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > {
> > @@ -5595,6 +5607,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > if (sched_feat(DELAY_DEQUEUE) && delay &&
> > !entity_eligible(cfs_rq, se)) {
> > update_load_avg(cfs_rq, se, 0);
> > + update_entity_lag(cfs_rq, se);
>
> The regression persists even with NO_DELAY_DEQUEUE, likely because
> update_entity_lag() is now called unconditionally in dequeue_entity()
> thereby adding avg_vruntime() overhead and cacheline contention for every
> dequeue.
>
> Do consider guarding the update_entity_lag() call in dequeue_entity()
> with sched_feat(DELAY_DEQUEUE) check to avoid this tax when the feature
> is disabled.
>
> > set_delayed(se);
> > return false;
> > }
> > @@ -5634,7 +5647,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > update_cfs_group(se);
> >
> > if (flags & DEQUEUE_DELAYED)
> > - finish_delayed_dequeue_entity(se);
> > + clear_delayed(se);
> >
> > if (cfs_rq->nr_queued == 0) {
> > update_idle_cfs_rq_clock_pelt(cfs_rq);
> > @@ -7088,18 +7101,14 @@ requeue_delayed_entity(struct sched_entity *se)
> > WARN_ON_ONCE(!se->sched_delayed);
> > WARN_ON_ONCE(!se->on_rq);
> >
> > - if (sched_feat(DELAY_ZERO)) {
> > - update_entity_lag(cfs_rq, se);
> > - if (se->vlag > 0) {
> > - cfs_rq->nr_queued--;
> > - if (se != cfs_rq->curr)
> > - __dequeue_entity(cfs_rq, se);
> > - se->vlag = 0;
> > - place_entity(cfs_rq, se, 0);
> > - if (se != cfs_rq->curr)
> > - __enqueue_entity(cfs_rq, se);
> > - cfs_rq->nr_queued++;
> > - }
> > + if (update_entity_lag(cfs_rq, se)) {
> > + cfs_rq->nr_queued--;
> > + if (se != cfs_rq->curr)
> > + __dequeue_entity(cfs_rq, se);
> > + place_entity(cfs_rq, se, 0);
> > + if (se != cfs_rq->curr)
> > + __enqueue_entity(cfs_rq, se);
> > + cfs_rq->nr_queued++;
>
> Triggering a full dequeue/enqueue cycle for every vlag adjustment appears
> to be a major bottleneck. Frequent RB-tree rebalancing here creates
> significant contention.

This adjustment is not supposed to happen

>
> Could we preserve fairness while recovering throughput by only re-queuing
> when the lag sign changes or a significant eligibility threshold is
> crossed?

Could you monitor how often we have to adjust the lag in your case? As
mentioned above, this shouldn't happen often, in particular the
increase of neg lag case

>
> > }
> >
> > update_load_avg(cfs_rq, se, 0);
> > --
> > 2.43.0
> >
> >
> Regards,
> Shubhang Kaushik