Re: [PATCH sched_ext/for-6.12-fixes] Disable SM_IDLE/rq empty path when scx_enabled

From: K Prateek Nayak
Date: Mon Sep 23 2024 - 23:41:09 EST


Hello Tejun,

Just seeking some clarification here; the reasoning to bypass SM_IDLE
fast-path looks sound otherwise.

On 9/23/2024 9:13 PM, Tejun Heo wrote:
Applied to sched_ext/for-6.12-fixes with minor edits:
------ 8< ------
From edf1c586e92675c4e0eb27758fcdb55a56838de1 Mon Sep 17 00:00:00 2001
From: Pat Somaru <patso@xxxxxxxxxxxxxx>
Date: Fri, 20 Sep 2024 15:41:59 -0400
Subject: [PATCH] sched, sched_ext: Disable SM_IDLE/rq empty path when
scx_enabled()

Disable the rq empty path when scx is enabled. SCX must consult the BPF
scheduler (via the dispatch path in balance) to determine if rq is empty.

This fixes stalls when scx is enabled.

Signed-off-by: Pat Somaru <patso@xxxxxxxxxxxxxx>
Fixes: 3dcac251b066 ("sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()")
Signed-off-by: Tejun Heo <tj@xxxxxxxxxx>
---
kernel/sched/core.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b6cc1cf499d6..43e453ab7e20 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6591,7 +6591,8 @@ static void __sched notrace __schedule(int sched_mode)
*/
prev_state = READ_ONCE(prev->__state);
if (sched_mode == SM_IDLE) {
- if (!rq->nr_running) {
+ /* SCX must consult the BPF scheduler to tell if rq is empty */

I was wondering if sched_ext case could simply do:

if (scx_enabled())
prev_balance(rq, prev, rf);

and use "rq->scx.flags" to skip balancing in balance_scx() later when
__pick_next_task() calls prev_balance() but (and please correct me if
I'm wrong here) balance_scx() calls balance_one() which can call
consume_dispatch_q() to pick a task from global / user-defined dispatch
queue, and in doing so, it does not update "rq->nr_running".

I could only see add_nr_running() being called from enqueue_task_scx()
and this is even before the ext core calls do_enqueue_task() which hooks
into the bpf layer which makes the decision where the task actually
goes.

Is my understanding correct that whichever CPU is the target for the
enqueue_task_scx() callback initially is the one that accounts the
enqueue in "rq->nr_running" until the task is dequeued or did I miss
something?

+ if (!rq->nr_running && !scx_enabled()) {
next = prev;
goto picked;
}

--
Thanks and Regards,
Prateek