Re: [PATCH sched_ext/for-7.1] sched_ext: Documentation: Add missing calls to quiescent(), runnable()
From: Christian Loehle
Date: Thu Apr 09 2026 - 05:46:42 EST
On 4/9/26 09:46, Kuba Piecuch wrote:
> On Wed Apr 8, 2026 at 2:54 PM UTC, Andrea Righi wrote:
> ...
>>>
>>> Another inaccuracy not related to direct dispatch: property changes can occur
>>> while a task is running, while the psedocode only allows for property changes
>>> while a task is queued.
>>
>> Sure... but again, modelling all the possible scenarios would make the
>> pseudocode completely unreadable.
>
> I'm not arguing we should cover all scenarios.
>
> I'm ok with omitting scenarios whose existence depends on a configuration flag
> or presence/absence of a callback, because:
>
> a) Using the right configuration, one can actually write a scheduler where the
> pseudocode is an accurate representation of the task lifecycle;
>
> b) The assumptions about the configuration can be clearly stated next to the
> pseudocode.
>
> I'm less ok with omitting specific scenarios that can't be simply "turned off"
> because they are triggered by the scheduled tasks themselves. A task's property
> being changed while it's running is one example of such a scenario -- one can't
> just prevent it from happening by setting a configuration flag, and sched_ext
> schedulers implementing dequeue/quiescent/runnable/enqueue should be aware of
> it.
>
> What I especially don't like is giving the reader a partial picture that looks
> like a complete one, as is the case with property changes here. We're letting
> the reader know that it can happen, but the pseudocode makes it look like it
> can only happen while a task is queued and not while it's running, giving the
> reader a false impression that they can assume property changes apply only to
> queued tasks.
Agreed FWIW, I've implemented a few schedulers that need to track state transitions
100% accurately and it was painful to get it 100% right.
I think it's either this or we add a sample BPF scheduler that actually does
track/validate all possible transitions per-task accurately to illustrate. (Maybe a
selftest?)
But that would mean the below becoming quite a bit more complex, too.
>
>>
>> IMHO it'd be better to give an overview of the most common use cases here and
>> clarify in the description that the diagram doesn't cover all the possible
>> scenarios. This one is a special use case that, personally, I wouldn't cover in
>> the pseudocode.
>>
>>>
>>> There's also preemption by a higher sched class, which is not covered in the
>>> loop condition (task_is_runnable(task) && task->scx.slice > 0), unless we take
>>> task_is_runnable() to return false if there's a higher-priority sched class
>>> with runnable tasks on the CPU, though that would be in conflict with the
>>> actual implementation of task_is_runnable() in include/linux/sched.h.
>>
>> Ditto.
>>
>>>
>>>>
>>>>>
>>>>> A more general comment about the pseudocode: I think it can be useful to
>>>>> introduce someone new to the general flow of the callbacks in sched_ext,
>>>>> but the documentation should be clear that this is a simplified view that
>>>>> makes assumptions about the behavior of the BPF scheduler itself (flags like
>>>>> SCX_OPS_ENQ_LAST, whether the scheduler uses direct dispatch), as well as
>>>>> the overall system (Can sched_ext be preempted by a higher-priority sched
>>>>> class? Can scheduling properties of a task be changed while it's running?)
>>>>> Without stating these assumptions clearly, we risk leaving the reader falsely
>>>>> believing they have a complete understanding.
>>>>
>>>> Of course this schema is not a complete representation of the entire sched_ext
>>>> state machine, if we put everything it'd become too big and complex. I think we
>>>> should just cover the most common use cases here. Maybe we can clarify this in
>>>> the description before this diagram.
>>>
>>> Let's agree on what inaccuracies need to be fixed and I'll send a v2 with fixes
>>> and attach an appropriate disclaimer to the pseudocode.
>>
>> If we move ops.dispatch() + ops.dequeue() inside the ops.enqueue() block I think
>> the pseudocode becomes "fairly" accurate. At least more accurate than what we
>> have right now. It won't be perfect, but it can help newer sched_ext devs having
>> an overview the task lifecycle without going too much into implementation
>> details.
>>
>> So, to recap, what do you think about this?
>>
>> ops.init_task(); /* A new task is created */
>> ops.enable(); /* Enable BPF scheduling for the task */
>>
>> while (task in SCHED_EXT) {
>> if (task can migrate)
>> ops.select_cpu(); /* Called on wakeup (optimization) */
>>
>> ops.runnable(); /* Task becomes ready to run */
>>
>> while (task_is_runnable(task)) {
>> if (task is not in a DSQ || task->scx.slice == 0) {
>> ops.enqueue(); /* Task can be added to a DSQ */
>>
>> /* Task property change (i.e., affinity, nice, etc.)? */
>> if (sched_change(task)) {
>> ops.dequeue(); /* Exiting BPF scheduler custody */
>> ops.quiescent();
>>
>> /* Property change callback, e.g. ops.set_weight() */
>>
>> ops.runnable();
>> continue;
>> }
>>
>> /* Any usable CPU becomes available */
>>
>> ops.dispatch(); /* Task is moved to a local DSQ */
>> ops.dequeue(); /* Exiting BPF scheduler custody */
Is this true here? Any dispatch followed by a dequeue?
>> }
>>
>> ops.running(); /* Task starts running on its assigned CPU */
>>
>> while (task_is_runnable(task) && task->scx.slice > 0) {
>> ops.tick(); /* Called every 1/HZ seconds */
>>
>> if (task->scx.slice == 0)
>> ops.dispatch(); /* task->scx.slice can be refilled */
>> }
>>
>> ops.stopping(); /* Task stops running (time slice expires or wait) */
>> }
>>
>> ops.quiescent(); /* Task releases its assigned CPU (wait) */
>> }
>>
>> ops.disable(); /* Disable BPF scheduling for the task */
>> ops.exit_task(); /* Task is destroyed */
>
> I don't love it (and I probably never will), but I agree it's the best so far.
> I'll send a v2 with the updated pseudocode and I'll put a bit of a disclaimer
> before it.