Re: [PATCH sched_ext/for-7.1] sched_ext: Documentation: Add missing calls to quiescent(), runnable()

From: Kuba Piecuch

Date: Wed Apr 08 2026 - 10:17:14 EST


On Wed Apr 8, 2026 at 1:49 PM UTC, Andrea Righi wrote:
> On Wed, Apr 08, 2026 at 12:40:09PM +0000, Kuba Piecuch wrote:
>> Hi Andrea,
>>
>> On Wed Apr 8, 2026 at 11:28 AM UTC, Andrea Righi wrote:
>> ...
>> >
>> > Looks good, but I noticed another issue, should we also change the condition up
>> > above as following?
>> >
>> > Documentation/scheduler/sched-ext.rst | 2 +-
>> > 1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
>> > index 29d36e248f58b..99df4cc982375 100644
>> > --- a/Documentation/scheduler/sched-ext.rst
>> > +++ b/Documentation/scheduler/sched-ext.rst
>> > @@ -423,7 +423,7 @@ by a sched_ext scheduler:
>> > ops.runnable(); /* Task becomes ready to run */
>> >
>> > while (task_is_runnable(task)) {
>> > - if (task is not in a DSQ && task->scx.slice == 0) {
>> > + if (task is not in a DSQ || task->scx.slice == 0) {
>> > ops.enqueue(); /* Task can be added to a DSQ */
>> >
>> > /* Task property change (i.e., affinity, nice, etc.)? */
>> >
>> > Because we trigger ops.enqueue() when the task expired its time slice or it
>> > becomes runnable and has not been added to a DSQ.
>> >
>> > This also represents correctly the sched_change() scenario: a task being
>> > re-enqueued after sched_change() still has its time slice > 0, but we need to
>> > call ops.enqueue() for it.
>>
>> I agree that the condition should be changed, but I'm not sure that this is
>> what it should look like.
>>
>> Is the "task is not in a DSQ" part of the condition there to handle direct
>> dispatch? Apart from direct dispatch from ops.select_cpu(), I wasn't able to
>> come up with a situation where we would reach this condition with the task
>> present on some DSQ.
>
> The intent is to represent the direct dispatch from ops.select_cpu(), since in
> that case ops.enqueue() is skipped.
>
> Honestly I think if we change the && to || in that condition, everything should
> be pretty accurate.

In the case of direct dispatch from ops.select_cpu() we don't invoke
ops.dispatch() and ops.dequeue() before ops.running(), right? The current
pseudocode calls them unconditionally.

Another inaccuracy not related to direct dispatch: property changes can occur
while a task is running, while the psedocode only allows for property changes
while a task is queued.

There's also preemption by a higher sched class, which is not covered in the
loop condition (task_is_runnable(task) && task->scx.slice > 0), unless we take
task_is_runnable() to return false if there's a higher-priority sched class
with runnable tasks on the CPU, though that would be in conflict with the
actual implementation of task_is_runnable() in include/linux/sched.h.

>
>>
>> A more general comment about the pseudocode: I think it can be useful to
>> introduce someone new to the general flow of the callbacks in sched_ext,
>> but the documentation should be clear that this is a simplified view that
>> makes assumptions about the behavior of the BPF scheduler itself (flags like
>> SCX_OPS_ENQ_LAST, whether the scheduler uses direct dispatch), as well as
>> the overall system (Can sched_ext be preempted by a higher-priority sched
>> class? Can scheduling properties of a task be changed while it's running?)
>> Without stating these assumptions clearly, we risk leaving the reader falsely
>> believing they have a complete understanding.
>
> Of course this schema is not a complete representation of the entire sched_ext
> state machine, if we put everything it'd become too big and complex. I think we
> should just cover the most common use cases here. Maybe we can clarify this in
> the description before this diagram.

Let's agree on what inaccuracies need to be fixed and I'll send a v2 with fixes
and attach an appropriate disclaimer to the pseudocode.