Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

From: Christian Loehle

Date: Mon Feb 02 2026 - 05:20:22 EST

On 2/2/26 07:45, Andrea Righi wrote:
> Hi Christian,
>
> On Sun, Feb 01, 2026 at 10:47:22PM +0000, Christian Loehle wrote:
>> On 2/1/26 09:08, Andrea Righi wrote:
>>> Currently, ops.dequeue() is only invoked when the sched_ext core knows
>>> that a task resides in BPF-managed data structures, which causes it to
>>> miss scheduling property change events. In addition, ops.dequeue()
>>> callbacks are completely skipped when tasks are dispatched to non-local
>>> DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
>>> track task state.
>>>
>>> Fix this by guaranteeing that each task entering the BPF scheduler's
>>> custody triggers exactly one ops.dequeue() call when it leaves that
>>> custody, whether the exit is due to a dispatch (regular or via a core
>>> scheduling pick) or to a scheduling property change (e.g.
>>> sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
>>> balancing, etc.).
>>>
>>> BPF scheduler custody concept: a task is considered to be in "BPF
>>> scheduler's custody" when it has been queued in BPF-managed data
>>> structures and the BPF scheduler is responsible for its lifecycle.
>>> Custody ends when the task is dispatched to a local DSQ, selected by
>>> core scheduling, or removed due to a property change.
>>>
>>> Tasks directly dispatched to local DSQs (via %SCX_DSQ_LOCAL or
>>> %SCX_DSQ_LOCAL_ON) bypass the BPF scheduler entirely and are not in its
>>> custody. As a result, ops.dequeue() is not invoked for these tasks.
>>>
>>> To identify dequeues triggered by scheduling property changes, introduce
>>> the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
>>> the dequeue was caused by a scheduling property change.
>>>
>>> New ops.dequeue() semantics:
>>> - ops.dequeue() is invoked exactly once when the task leaves the BPF
>>> scheduler's custody, in one of the following cases:
>>> a) regular dispatch: task was dispatched to a non-local DSQ (global
>>> or user DSQ), ops.dequeue() called without any special flags set
>>> b) core scheduling dispatch: core-sched picks task before dispatch,
>>> dequeue called with %SCX_DEQ_CORE_SCHED_EXEC flag set
>>> c) property change: task properties modified before dispatch,
>>> dequeue called with %SCX_DEQ_SCHED_CHANGE flag set
>>>
>>> This allows BPF schedulers to:
>>> - reliably track task ownership and lifecycle,
>>> - maintain accurate accounting of managed tasks,
>>> - update internal state when tasks change properties.
>>>
>>
>> So I have finally gotten around updating scx_storm to the new semantics,
>> see:
>> https://github.com/cloehle/scx/tree/cloehle/scx-storm-qmap-insert-local-dequeue-semantics
>>
>> I don't think the new ops.dequeue() are enough to make inserts to local-on
>> from anywhere safe, because it's still racing with dequeue from another CPU?
>
> Yeah, with this patch set BPF schedulers get proper ops.dequeue()
> callbacks, but we're not fixing the usage of SCX_DSQ_LOCAL_ON from
> ops.dispatch().
>
> When task properties change between scx_bpf_dsq_insert() and the actual
> dispatch, task_can_run_on_remote_rq() can still trigger a fatal
> scx_error().
>
> The ops.dequeue(SCX_DEQ_SCHED_CHANGE) notifications happens after the
> property change, so it can't prevent already-queued dispatches from
> failing. The race window is between ops.dispatch() returning and
> dispatch_to_local_dsq() executing.
>
> We can address this in a separate patch set. One thing at a time. :)
>
>>
>> Furthermore I can reproduce the following with this patch applied quite easily
>> with something like
>>
>> hackbench -l 1000 & timeout 10 ./build/scheds/c/scx_storm
>>
>> [ 44.356878] sched_ext: BPF scheduler "simple" enabled
>> [ 59.315370] sched_ext: BPF scheduler "simple" disabled (unregistered from user space)
>> [ 85.366747] sched_ext: BPF scheduler "storm" enabled
>> [ 85.371324] ------------[ cut here ]------------
>> [ 85.373370] WARNING: kernel/sched/sched.h:1571 at update_locked_rq+0x64/0x6c, CPU#5: gmain/1111
>
> Ah yes! I think I see it, can you try this on top?
>
> Thanks,
> -Andrea
>
> kernel/sched/ext.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 6d6f1253039d8..d8fed4a49195d 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2248,7 +2248,7 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
> p->scx.flags |= SCX_TASK_OPS_ENQUEUED;
> } else {
> if (p->scx.flags & SCX_TASK_OPS_ENQUEUED)
> - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, task_rq(p), p, 0);
> + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
>
> p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> }

Yup, that fixes it!