Re: [PATCH v2] sched/eevdf: Fix wakeup-preempt by checking cfs_rq->nr_running

From: Honglei Wang
Date: Fri Sep 27 2024 - 09:40:56 EST




On 2024/9/27 14:56, Chen Yu wrote:
Hello Honglei,

On 2024-09-27 at 09:54:28 +0800, Honglei Wang wrote:


On 2024/9/25 16:54, Chen Yu wrote:
Commit 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt")
introduced a mechanism that a wakee with shorter slice could preempt
the current running task. It also lower the bar for the current task
to be preempted, by checking the rq->nr_running instead of cfs_rq->nr_running
when the current task has ran out of time slice. But there is a scenario
that is problematic. Say, if there is 1 cfs task and 1 rt task, before
85e511df3cec, update_deadline() will not trigger a reschedule, and after
85e511df3cec, since rq->nr_running is 2 and resched is true, a resched_curr()
would happen.

Some workloads (like the hackbench reported by lkp) do not like
over-scheduling. We can see that the preemption rate has been
increased by 2.2%:

1.654e+08 +2.2% 1.69e+08 hackbench.time.involuntary_context_switches

Restore its previous check criterion.

Fixes: 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt")
Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
Closes: https://lore.kernel.org/oe-lkp/202409231416.9403c2e9-oliver.sang@xxxxxxxxx
Suggested-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
Tested-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
Signed-off-by: Chen Yu <yu.c.chen@xxxxxxxxx>
---
v1->v2:
Check cfs_rq->nr_running instead of rq->nr_running(K Prateek Nayak)
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 225b31aaee55..53a351b18740 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1247,7 +1247,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
account_cfs_rq_runtime(cfs_rq, delta_exec);
- if (rq->nr_running == 1)
+ if (cfs_rq->nr_running == 1)
return;
Hi Yu,

I'm wondering if commit 85e511df3cec wants to give more chances to do
resched just in case there are 'short slices' tasks ready in the other cfs
hierarchy.
Does something like rq->cfs->nr_running == 1 make more sense? But
maybe it helps less than 'cfs_rq->nr_running == 1' in this hackbench case.


Thanks for taking a look.

It could be possible that Peter wanted the short tasks to preempt other quickly.
If I understand correctly, when we say preemption, we usually consider two
tasks which are in the same cfs_rq(level). For example, check_preempt_wakeup_fair()
iterates the hierarchy from down-up until the current task and the wakee are in the
same level via find_matching_se(&se, &pse), then check if the wakee can preempt the
current. This should be consistent with the tick preemption in update_curr(). And
whether the short task should preempt the current is checked by
update_curr() -> did_preempt_short(), rather than checking the cfs_rq->nr_running/nr_h_running
I suppose.

Hi Yu,

Yep, I understand the preemption should happen in the same cfs level. Just not sure the purpose of the 'nr_running check' stuff. Perhaps its role is just to judge whether it’s necessary to do the preemption check. If there is at least one more ready (cfs) task in the rq and current is not eligible, we take care of the waiting tasks. Thoughts?

Thanks,
Honglei

Thanks,
Chenyu