Re: [PATCH] sched/eevdf: Fix wakeup-preempt by checking cfs_rq->nr_running
From: K Prateek Nayak
Date: Tue Sep 24 2024 - 06:28:12 EST
Hello Chenyu,
On 9/23/2024 12:51 PM, Chen Yu wrote:
Commit 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt")
introduced a mechanism that a wakee with shorter slice could preempt
the current running task. It also lower the bar for the current task
to be preempted, by checking the rq->nr_running instead of cfs_rq->nr_running
when the current task has ran out of time slice. Say, if there is 1 cfs
task and 1 rt task, before 85e511df3cec, update_deadline() will
not trigger a reschedule, and after 85e511df3cec, since rq->nr_running
is 2 and resched is true, a resched_curr() would happen.
Some workloads (like the hackbench reported by lkp) do not like
over-scheduling. We can see that the preemption rate has been
increased by 2.2%:
1.654e+08 +2.2% 1.69e+08 hackbench.time.involuntary_context_switches
Restore its previous check criterion.
Fixes: 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt")
Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
Closes: https://lore.kernel.org/oe-lkp/202409231416.9403c2e9-oliver.sang@xxxxxxxxx
Signed-off-by: Chen Yu <yu.c.chen@xxxxxxxxx>
Gave it a spin on my dual socket 3rd Generation EPYC System and I do not
as big a jump in hackbench numbers as Oliver reported, most likely
because I couldn't emulate the exact scenario where a fair task is
running in presence of an RT task queued. Following are numbers from my
testing:
==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: tip[pct imp](CV) preempt-fix[pct imp](CV)
1-groups 1.00 [ -0.00]( 2.60) 1.00 [ 0.17]( 2.12)
2-groups 1.00 [ -0.00]( 1.21) 0.98 [ 2.05]( 0.95)
4-groups 1.00 [ -0.00]( 1.63) 0.97 [ 2.65]( 1.53)
8-groups 1.00 [ -0.00]( 1.34) 0.99 [ 0.81]( 1.33)
16-groups 1.00 [ -0.00]( 2.07) 0.98 [ 2.31]( 1.09)
--
Feel free to include:
Tested-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 225b31aaee55..2859fc7e2da2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1025,7 +1025,7 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
/*
* The task has consumed its request, reschedule.
*/
- return true;
+ return (cfs_rq->nr_running > 1);
Was there a strong reason why Peter decided to use "rq->nr_running"
instead of "cfs_rq->nr_running" with PREEMPT_SHORT in update_curr()?
I wonder if it was to force a pick_next_task() cycle to dequeue a
possibly delayed entity but AFAICT, "cfs_rq->nr_running" should
account for the delayed entity still on the cfs_rq and perhaps the
early return in update_curr() can just be changed to use
"cfs_rq->nr_running". Not sure if I'm missing something trivial.
}
#include "pelt.h"
--
Thanks and Regards,
Prateek