Re: [PATCH v2] sched: fair: Prevent negative lag increase during delayed dequeue

From: Shubhang Kaushik

Date: Sat Apr 04 2026 - 04:13:55 EST

Hi Vincent,

Thanks for the feedback. Previously, the baseline I referred to was the top of tip/sched/core. You were right about the slice tuning; the delta is much more apparent with a shorter preemption window. After setting base_slice_ns to 400us on the 80 core Ampere Altra, the results shifted significantly in favor of the patch.

The tail latency (P99.9) dropped from 4194us to 2205us (~47% reduction). While we see a slight increase in the P50 (from 62us to 88us), likely due to the additional instruction overhead in the update_entity_lag() hot-path, the overall distribution is much tighter under high contention.

The most notable impact is on system throughput. In a saturated hackbench run (32 groups/800 tasks), execution time dropped from 155.8s to 91.5s. This suggests that preventing the inflation of negative lag during delayed dequeue effectively mitigates runqueue logjams on high core count SMP. By ensuring short slice tasks aren't unfairly penalized upon wakeup, we're seeing much better fluidness across the 80 cores.

System: Ampere Altra (80 Cores, 1P)
Baseline: tip/sched/core @ commit 2d4cc371baa5
Merged Patch: tip/sched/core @ commit 059258b0d424
Scheduler Tuning: base_slice_ns = 400,000 (0.4ms)

Hackbench results:- Background load: 32 groups / 800 tasks
Test Case Baseline(sec) Merged(sec) Throughput
1 Thread 12.62 7.72 +38.8%
4 Threads 26.85 16.36 +39.1%
8 Threads 47.53 33.59 +29.3%
16 Processes 77.67 48.10 +38.1%
32 Processes 155.84 91.46 +41.3%

CyclicTest results:-
Background load: 20 groups / 800 tasks.
Metric Baseline Merged Latency
P50 (Median) 62 us 88 us +41.9%
P99 1956 us 1319 us -32.5%
P99.9 (Tail) 4194 us 2205 us -47.4%

Regarding the lag adjustment frequency, it seems to be an exceptional event. I monitored the logic using a kprobe on requeue_delayed_entity during the 32 process saturation test. Out of millions of scheduling events, the lag adjustment was triggered only a few times.

The patch does provides an efficient guardrail that prevents EEVDF lag starvation at scale without imposing a frequent adjustment tax.

Feel free to include:-
Tested-by: Shubhang Kaushik <shubhang@xxxxxxxxxxxxxxxxxxxxxx>

Regards,
Shubhang Kaushik

On Fri, 3 Apr 2026, Vincent Guittot wrote:

On Thu, 2 Apr 2026 at 21:27, Shubhang Kaushik
<shubhang@xxxxxxxxxxxxxxxxxxxxxx> wrote:

Hi Vincent,

I have been testing your v2 patch on my 80 core Ampere Altra (ARMv8
Neoverse-N1) 1P system using an idle tickless kernel on the latest
tip/sched/core branch.

On Tue, 31 Mar 2026, Vincent Guittot wrote:

Delayed dequeue feature aims to reduce the negative lag of a dequeued task
while sleeping but it can happens that newly enqueued tasks will move
backward the avg vruntime and increase its negative lag.
When the delayed dequeued task wakes up, it has more neg lag compared to
being dequeued immediately or to other tasks that have been dequeued just
before theses new enqueues.

Ensure that the negative lag of a delayed dequeued task doesn't increase
during its delayed dequeued phase while waiting for its neg lag to
diseappear. Similarly, we remove any positive lag that the delayed
dequeued task could have gain during thsi period.

Short slice tasks are particularly impacted in overloaded system.

Test on snapdragon rb5:

hackbench -T -p -l 16000000 -g 2 1> /dev/null &
cyclictest -t 1 -i 2777 -D 333 --policy=fair --mlock -h 20000 -q

The scheduling latency of cyclictest is:

tip/sched/core tip/sched/core +this patch
cyclictest slice (ms) (default)2.8 8 8
hackbench slice (ms) (default)2.8 20 20
Total Samples | 115632 119733 119806
Average (us) | 364 64(-82%) 61(- 5%)
Median (P50) (us) | 60 56(- 7%) 56( 0%)
90th Percentile (us) | 1166 62(-95%) 62( 0%)
99th Percentile (us) | 4192 73(-98%) 72(- 1%)
99.9th Percentile (us) | 8528 2707(-68%) 1300(-52%)
Maximum (us) | 17735 14273(-20%) 13525(- 5%)

Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
---

I replicated this cyclictest environment scaled for 80 cores using a
background hackbench load (-g 20). On Ampere Altra, I did not see the
tail latency reduction that you observed on the 8-core Snapdragon. In
fact, both average and max latencies increased slightly.

Metric | Baseline | Patched | Delta (%)
-------------|----------|----------|-----------
Max Latency | 9141us | 9426us | +3.11%
Avg Latency | 206us | 217us | +5.33%
Min Latency | 14us | 13us | -7.14%

Without setting a shorter custom slice for cyclictest, you will not
see any major differences. The difference ehappens in the p99 and
p99.9 with a shorter slice

More concerning is the impact on throughput. At 8-16 threads, hackbench
execution times increased by ~30%. I attempted to isolate this by

Hmm, I run some perf test and I haven't seen any difference for
hackbench with various number of group

disabling the DELAY_DEQUEUE sched_feature. But the regression persists
even with NO_DELAY_DEQUEUE, pointing to overhead in the modified
update_entity_lag() path itself.

Test Case | Baseline | Patched | Delta (%) | Patched(NO_DELAYDQ)

By baseline, do you mean tip/sched/core or v7.0-rcx ?

-------------|----------|----------|-----------|--------------------
4 Threads | 13.77s | 17.53s | +27.3% | 17.16s
8 Threads | 24.39s | 31.90s | +30.8% | 30.67s
16 Threads | 47.92s | 60.46s | +26.2% | 62.53s
32 Processes | 118.08s | 103.16s | -12.6% | 101.87s

That's surprising. I ran some perf tests with the patch and haven't
seen any differences

tip/sched/core + patch
hackbench 1 process socket 0,581 0,580 (0,0 %)
stddev 2,7 % 2,5 %
hackbench 4 process socket 0,612 0,612 (0,0 %)
stddev 0,9 % 2,3 %
hackbench 8 process socket 0,662 0,659 (0,4 %)
stddev 1,0 % 1,8 %
hackbench 16 process socket 0,700 0,699 (0,3 %)
stddev 1,6 % 1,3 %
hackbench 1 process pipe 0,796 0,797 (-0,2 %)
stddev 1,5 % 1,9 %
hackbench 4 process pipe 0,699 0,694 (0,8 %)
stddev 3,7 % 2,5 %
hackbench 8 process pipe 0,631 0,636 (-0,9 %)
stddev 3,4 % 2,2 %
hackbench 16 process pipe 0,612 0,594 (2,9 %)
stddev 1,8 % 1,5 %
hackbench 1 thread socket 0,571 0,570 (0,1 %)
stddev 2,3 % 1,5 %
hackbench 4 thread socket 0,591 0,594 (-0,5 %)
stddev 1,2 % 0,7 %
hackbench 8 thread socket 0,621 0,628 (-1,2 %)
stddev 1,3 % 1,4 %
hackbench 16 thread socket 0,660 0,653 (1,0 %)
stddev 0,7 % 0,9 %
hackbench 1 thread pipe 0,860 0,864 (-0,6 %)
stddev 1,4 % 2,0 %
hackbench 4 thread pipe 0,828 0,821 (0,9 %)
stddev 3,5 % 4,7 %
hackbench 8 thread pipe 0,725 0,739 (-1,8 %)
stddev 2,3 % 8,6 %
hackbench 16 thread pipe 0,647 0,645 (0,4 %)
stddev 4,3 % 4,2 %

Since v1:
- Embedded the check of lag evolution of delayed dequeue entities in
update_entity_lag() to include all cases.

While the patch shows a ~12.6% improvement at high saturation (32
processes), the throughput cost at mid-range scales appears to outweigh
the fairness benefits on our high core system, as even the worst-case
wake-up latencies did not improve.

kernel/sched/fair.c | 53 ++++++++++++++++++++++++++-------------------
1 file changed, 31 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 226509231e67..c1ffe86bf78d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -840,11 +840,30 @@ static s64 entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 avrunt
return clamp(vlag, -limit, limit);
}

-static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
+/*
+ * Delayed dequeue aims to reduce the negative lag of a dequeued task.
+ * While updating the lag of an entity, check that negative lag didn't increase
+ * during the delayed dequeue period which would be unfair.
+ * Similarly, check that the entity didn't gain positive lag when DELAY_ZERO is
+ * set.
+ *
+ * Return true if the lag has been adjusted.
+ */
+static bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+ s64 vlag;
+
WARN_ON_ONCE(!se->on_rq);

- se->vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq));
+ vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq));
+
+ if (se->sched_delayed)
+ /* previous vlag < 0 otherwise se would not be delayed */
+ se->vlag = clamp(vlag, se->vlag, sched_feat(DELAY_ZERO) ? 0 : S64_MAX);
+ else
+ se->vlag = vlag;
+
+ return (vlag != se->vlag);
}

/*
@@ -5563,13 +5582,6 @@ static void clear_delayed(struct sched_entity *se)
}
}

-static inline void finish_delayed_dequeue_entity(struct sched_entity *se)
-{
- clear_delayed(se);
- if (sched_feat(DELAY_ZERO) && se->vlag > 0)
- se->vlag = 0;
-}
-
static bool
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
@@ -5595,6 +5607,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (sched_feat(DELAY_DEQUEUE) && delay &&
!entity_eligible(cfs_rq, se)) {
update_load_avg(cfs_rq, se, 0);
+ update_entity_lag(cfs_rq, se);

The regression persists even with NO_DELAY_DEQUEUE, likely because
update_entity_lag() is now called unconditionally in dequeue_entity()
thereby adding avg_vruntime() overhead and cacheline contention for every
dequeue.

Do consider guarding the update_entity_lag() call in dequeue_entity()
with sched_feat(DELAY_DEQUEUE) check to avoid this tax when the feature
is disabled.

set_delayed(se);
return false;
}
@@ -5634,7 +5647,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
update_cfs_group(se);

if (flags & DEQUEUE_DELAYED)
- finish_delayed_dequeue_entity(se);
+ clear_delayed(se);

if (cfs_rq->nr_queued == 0) {
update_idle_cfs_rq_clock_pelt(cfs_rq);
@@ -7088,18 +7101,14 @@ requeue_delayed_entity(struct sched_entity *se)
WARN_ON_ONCE(!se->sched_delayed);
WARN_ON_ONCE(!se->on_rq);

- if (sched_feat(DELAY_ZERO)) {
- update_entity_lag(cfs_rq, se);
- if (se->vlag > 0) {
- cfs_rq->nr_queued--;
- if (se != cfs_rq->curr)
- __dequeue_entity(cfs_rq, se);
- se->vlag = 0;
- place_entity(cfs_rq, se, 0);
- if (se != cfs_rq->curr)
- __enqueue_entity(cfs_rq, se);
- cfs_rq->nr_queued++;
- }
+ if (update_entity_lag(cfs_rq, se)) {
+ cfs_rq->nr_queued--;
+ if (se != cfs_rq->curr)
+ __dequeue_entity(cfs_rq, se);
+ place_entity(cfs_rq, se, 0);
+ if (se != cfs_rq->curr)
+ __enqueue_entity(cfs_rq, se);
+ cfs_rq->nr_queued++;

Triggering a full dequeue/enqueue cycle for every vlag adjustment appears
to be a major bottleneck. Frequent RB-tree rebalancing here creates
significant contention.

This adjustment is not supposed to happen

Could we preserve fairness while recovering throughput by only re-queuing
when the lag sign changes or a significant eligibility threshold is
crossed?

Could you monitor how often we have to adjust the lag in your case? As
mentioned above, this shouldn't happen often, in particular the
increase of neg lag case

}

update_load_avg(cfs_rq, se, 0);
--
2.43.0

Regards,
Shubhang Kaushik