[PATCH V2] sched: Forward deadline for early tick

From: zihan zhou
Date: Wed Dec 25 2024 - 00:41:40 EST


Due to the tick error, the eevdf scheduler exhibits unexpected behavior.
For example, a machine with sysctl_sched_base_slice=3ms, CONFIG_HZ=1000
should trigger a tick every 1ms. A se (sched_entity) with default weight
1024 should theoretically reach its deadline on the third tick.

However, the tick often arrives a little faster than expected. In this
case, the se can only wait until the next tick to consider that it has
reached its deadline, and will run 1ms longer.

vruntime + sysctl_sched_base_slice = deadline
|-----------|-----------|-----------|-----------|
1ms 1ms 1ms 1ms
^ ^ ^ ^
tick1 tick2 tick3 tick4(nearly 4ms)

Here is a simple example of this scenario,
where sysctl_sched_base_slice=3ms, CONFIG_HZ=1000,
the CPU is Intel(R) Xeon(R) Platinum 8338C CPU @ 2.60GHz,
and "while :; do :; done &" is run twice with pids 72112 and 72113.
According to the design of eevdf,
both should run for 3ms each, but often they run for 4ms.

time cpu task name wait time sch delay run time
[tid/pid] (msec) (msec) (msec)
------------- ------ ------------- --------- --------- ---------
56696.846136 [0001] perf[72368] 0.000 0.000 0.000
56696.849378 [0001] bash[72112] 0.000 0.000 3.241
56696.852379 [0001] bash[72113] 0.000 0.000 3.000
56696.852964 [0001] sleep[72369] 0.000 6.261 0.584
56696.856378 [0001] bash[72112] 3.585 0.000 3.414
56696.860379 [0001] bash[72113] 3.999 0.000 4.000 <
56696.864379 [0001] bash[72112] 4.000 0.000 4.000 <
56696.868377 [0001] bash[72113] 4.000 0.000 3.997 <
56696.871378 [0001] bash[72112] 3.997 0.000 3.000
56696.874377 [0001] bash[72113] 3.000 0.000 2.999
56696.877377 [0001] bash[72112] 2.999 0.000 2.999
56696.881377 [0001] bash[72113] 2.999 0.000 3.999 <

There are two reasons for tick error: clockevent precision and the
CONFIG_IRQ_TIME_ACCOUNTING. with CONFIG_IRQ_TIME_ACCOUNTING every tick
will less than 1ms, but even without it, because of clockevent precision,
tick still often less than 1ms. In the system above, there is no such
config, but the task still often takes more than 3ms.

To solve this problem, we add a sched feature FORWARD_DEADLINE,
consider forwarding the deadline appropriately. When vruntime is very
close to the deadline, and the task is ineligible, we consider that task
should be resched, the tolerance is set to min(vslice/128, tick/2).

when open FORWARD_DEADLINE,
the task will run once every 3ms as designed by eevdf:

time cpu task name wait time sch delay run time
[tid/pid] (msec) (msec) (msec)
----------- ------ ---------------- --------- --------- ---------
207.207293 [0001] bash[1699] 3.998 0.000 3.000
207.210294 [0001] bash[1694] 3.000 0.000 3.000
207.213300 [0001] bash[1699] 3.000 0.000 3.006
207.216293 [0001] bash[1694] 3.006 0.000 2.993
207.219293 [0001] bash[1699] 2.993 0.000 3.000
207.222293 [0001] bash[1694] 3.000 0.000 2.999
207.225293 [0001] bash[1699] 2.999 0.000 3.000
207.228293 [0001] bash[1694] 3.000 0.000 3.000
207.231293 [0001] bash[1699] 3.000 0.000 3.000
207.234293 [0001] bash[1694] 3.000 0.000 2.999
207.237292 [0001] bash[1699] 2.999 0.000 2.999
207.240293 [0001] bash[1694] 2.999 0.000 3.000
207.243293 [0001] bash[1699] 3.000 0.000 3.000


Signed-off-by: zihan zhou <15645113830zzh@xxxxxxxxx>
Signed-off-by: yaozhenguo <yaozhenguo@xxxxxx>
Signed-off-by: yaowenchao <yaowenchao@xxxxxx>
---
v2: 1. just forward deadline for ineligible task.
2. for ineligible task, vd_i = vd_i + r_i / w_i, but for eligible task,
vd_i = ve_i + r_i / w_i, which is the same as before.
3. tolerance = min(vslice>>7, TICK_NSEC/2), prevent scheduling errors
from increasing when vslice is too large relative to tick.
---
kernel/sched/fair.c | 42 +++++++++++++++++++++++++++++++++++++----
kernel/sched/features.h | 7 +++++++
2 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2d16c8545c71..9cc52f632bb1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1006,8 +1006,10 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
*/
static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- if ((s64)(se->vruntime - se->deadline) < 0)
- return false;
+
+ u64 vslice;
+ u64 tolerance = 0;
+ u64 next_deadline;

/*
* For EEVDF the virtual time slope is determined by w_i (iow.
@@ -1016,11 +1018,43 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
*/
if (!se->custom_slice)
se->slice = sysctl_sched_base_slice;
+ vslice = calc_delta_fair(se->slice, se);
+
+ /*
+ * vd_i = ve_i + r_i / w_i
+ */
+ next_deadline = se->vruntime + vslice;
+
+ if (sched_feat(FORWARD_DEADLINE))
+ tolerance = min(vslice>>7, TICK_NSEC/2);
+
+ if ((s64)(se->vruntime + tolerance - se->deadline) < 0)
+ return false;

/*
- * EEVDF: vd_i = ve_i + r_i / w_i
+ * when se->vruntime + tolerance - se->deadline >= 0
+ * but se->vruntime - se->deadline < 0,
+ * there is two case: if entity is eligible?
+ * if entity is not eligible, we don't need wait deadline, because
+ * eevdf don't guarantee
+ * an ineligible entity can exec its request time in one go.
+ * but when entity eligible, just let it run, which is the
+ * same processing logic as before.
*/
- se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
+
+ if (sched_feat(FORWARD_DEADLINE) && (s64)(se->vruntime - se->deadline) < 0) {
+
+ if (entity_eligible(cfs_rq, se))
+ return false;
+
+ /*
+ * vd_i = vd_i + r_i / w_i
+ */
+ next_deadline = se->deadline + vslice;
+ }
+
+
+ se->deadline = next_deadline;

/*
* The task has consumed its request, reschedule.
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 290874079f60..5c74deec7209 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -24,6 +24,13 @@ SCHED_FEAT(RUN_TO_PARITY, true)
*/
SCHED_FEAT(PREEMPT_SHORT, true)

+/*
+ * For some cases where the tick is faster than expected,
+ * move the deadline forward
+ */
+SCHED_FEAT(FORWARD_DEADLINE, true)
+
+
/*
* Prefer to schedule the task we woke last (assuming it failed
* wakeup-preemption), since its likely going to consume data we
--
2.39.3 (Apple Git-146)