Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

From: Kuba Piecuch

Date: Mon Feb 02 2026 - 06:58:10 EST

Hi Andrea,

Looks good overall, but we need to settle on the global DSQ semantics, plus
some edge cases that need clearing up.

On Sun Feb 1, 2026 at 9:08 AM UTC, Andrea Righi wrote:
> diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
> index 404fe6126a769..6d9e82e6ca9d4 100644
> --- a/Documentation/scheduler/sched-ext.rst
> +++ b/Documentation/scheduler/sched-ext.rst
> @@ -252,6 +252,80 @@ The following briefly shows how a waking task is scheduled and executed.
>
> * Queue the task on the BPF side.
>
> + **Task State Tracking and ops.dequeue() Semantics**
> +
> + Once ``ops.select_cpu()`` or ``ops.enqueue()`` is called, the task may
> + enter the "BPF scheduler's custody" depending on where it's dispatched:
> +
> + * **Direct dispatch to local DSQs** (``SCX_DSQ_LOCAL`` or
> + ``SCX_DSQ_LOCAL_ON | cpu``): The task bypasses the BPF scheduler
> + entirely and goes straight to the CPU's local run queue. The task
> + never enters BPF custody, and ``ops.dequeue()`` will not be called.
> +
> + * **Dispatch to non-local DSQs** (``SCX_DSQ_GLOBAL`` or custom DSQs):
> + the task enters the BPF scheduler's custody. When the task later
> + leaves BPF custody (dispatched to a local DSQ, picked by core-sched,
> + or dequeued for sleep/property changes), ``ops.dequeue()`` will be
> + called exactly once.
> +
> + * **Queued on BPF side**: The task is in BPF data structures and in BPF
> + custody, ``ops.dequeue()`` will be called when it leaves.
> +
> + The key principle: **ops.dequeue() is called when a task leaves the BPF
> + scheduler's custody**. A task is in BPF custody if it's on a non-local
> + DSQ or in BPF data structures. Once dispatched to a local DSQ or after
> + ops.dequeue() is called, the task is out of BPF custody and the BPF
> + scheduler no longer needs to track it.
> +
> + This works correctly with the ``ops.select_cpu()`` direct dispatch
> + optimization: even though it skips ``ops.enqueue()`` invocation, if the
> + task is dispatched to a non-local DSQ, it enters BPF custody and will
> + get ``ops.dequeue()`` when it leaves. This provides the performance
> + benefit of avoiding the ``ops.enqueue()`` roundtrip while maintaining
> + correct state tracking.
> +
> + The dequeue can happen for different reasons, distinguished by flags:
> +
> + 1. **Regular dispatch workflow**: when the task is dispatched from a
> + non-local DSQ to a local DSQ (leaving BPF custody for execution),
> + ``ops.dequeue()`` is triggered without any special flags.

Maybe add a note that this can happen asynchronously, without the BPF
scheduler explicitly dispatching the task to a local DSQ, when the task
is on a global DSQ? Or maybe make that case into a separate dequeue reason
with its own flag, e.g. SCX_DEQ_PICKED_FROM_GLOBAL_DSQ?

> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index bcb962d5ee7d8..0d003d2845393 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -84,6 +84,7 @@ struct scx_dispatch_q {
> /* scx_entity.flags */
> enum scx_ent_flags {
> SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */
> + SCX_TASK_OPS_ENQUEUED = 1 << 1, /* under ext scheduler's custody */

Nit: I think "in BPF scheduler's custody" would be a bit clearer, as
"ext scheduler" could potentially be interpreted to mean SCHED_CLASS_EXT
as a whole.

> @@ -1523,6 +1603,30 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
>
> switch (opss & SCX_OPSS_STATE_MASK) {
> case SCX_OPSS_NONE:
> + /*
> + * Task is not in BPF data structures (either dispatched to
> + * a DSQ or running). Only call ops.dequeue() if the task
> + * is still in BPF scheduler's custody
> + * (%SCX_TASK_OPS_ENQUEUED is set).
> + *
> + * If the task has already been dispatched to a local DSQ
> + * (left BPF custody), the flag will be clear and we skip
> + * ops.dequeue()
> + *
> + * If this is a property change (not sleep/core-sched) and
> + * the task is still in BPF custody, set the
> + * %SCX_DEQ_SCHED_CHANGE flag.
> + */
> + if (SCX_HAS_OP(sch, dequeue) &&
> + p->scx.flags & SCX_TASK_OPS_ENQUEUED) {
> + u64 flags = deq_flags;
> +
> + if (!(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
> + flags |= SCX_DEQ_SCHED_CHANGE;

I think this logic will result in ops.dequeue(SCHED_CHANGE) being called for
tasks being picked from a global DSQ being migrated from a remote rq to the
local rq, which, while technically correct since the task is migrating rqs,
may be confusing, since it fits two cases in the documentation:

* Since the task is leaving BPF custody for execution, ops.dequeue() should be
called without any special flags.
* Since the task is being migrated between rqs, ops.dequeue() should be called
with SCX_DEQ_SCHED_CHANGE.

> +
> + SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, flags);
> + p->scx.flags &= ~SCX_TASK_OPS_ENQUEUED;
> + }
> break;
> case SCX_OPSS_QUEUEING:
> /*

Thanks,
Kuba