Re: [PATCH 3/4] sched_ext: Fix ops.dequeue() semantics

From: Andrea Righi

Date: Tue Feb 17 2026 - 06:51:53 EST


On Mon, Feb 16, 2026 at 08:30:44PM -1000, Tejun Heo wrote:
> Hello,
>
> On Sun, Feb 15, 2026 at 08:16:55PM +0100, Andrea Righi wrote:
> > +/*
> > + * Return true if @p is moving due to an internal SCX migration, false
> > + * otherwise.
> > + */
> > +static inline bool task_scx_migrating(struct task_struct *p)
> > +{
> > + return task_on_rq_migrating(p) && p->scx.sticky_cpu >= 0;
> > +}
>
> Can you explain why testing task_on_rq_migrating() is necessary? What does
> just testing p->scx.sticky_cpu miss?

Yeah, we just need to check sticky_cpu here, whenever sticky_cpu >= 0 we
are by definition in the middle of an SCX-initiated migration. I'll change
that and add a comment.

>
> > @@ -1106,6 +1153,12 @@ static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
> > dsq_mod_nr(dsq, 1);
> > p->scx.dsq = dsq;
> >
> > + /*
> > + * Non-terminal DSQs: task enters BPF scheduler's custody.
> > + */
> > + if (!is_terminal_dsq(dsq))
> > + p->scx.flags |= SCX_TASK_IN_CUSTODY;
>
> Can't this be done in the local_dsq_post_enq() else block?

Oh yes, this is better.

>
> > /*
> > * scx.ddsp_dsq_id and scx.ddsp_enq_flags are only relevant on the
> > * direct dispatch path, but we clear them here because the direct
> > @@ -1122,10 +1175,23 @@ static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
> > if (enq_flags & SCX_ENQ_CLEAR_OPSS)
> > atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE);
> >
> > - if (is_local)
> > + if (is_local) {
> > local_dsq_post_enq(dsq, p, enq_flags);
> > - else
> > + } else {
> > + if (dsq->id == SCX_DSQ_GLOBAL || dsq->id == SCX_DSQ_BYPASS) {
> > + /*
> > + * Task is on the global or bypass DSQ: call
> > + * ops.dequeue() if the task was in BPF custody and
> > + * it's not an internal SCX migration.
> > + */
> > + if ((p->scx.flags & SCX_TASK_IN_CUSTODY) &&
> > + !task_scx_migrating(p)) {
> > + call_task_dequeue(sch, rq, p, 0);
> > + p->scx.flags &= ~SCX_TASK_IN_CUSTODY;
> > + }
> > + }
>
> If you add else {} here, that'd catch the non-terminal DSQs, right? If that
> works, I think that'd be more logical organization.

Agreed.

>
> > @@ -1531,6 +1611,25 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> >
> > switch (opss & SCX_OPSS_STATE_MASK) {
> > case SCX_OPSS_NONE:
> > + /*
> > + * If the task is still in BPF scheduler's custody
> > + * (%SCX_TASK_IN_CUSTODY is set) call ops.dequeue().
> > + *
> > + * The code that clears ops_state to %SCX_OPSS_NONE does
> > + * not always clear %SCX_TASK_IN_CUSTODY: in
> > + * dispatch_to_local_dsq(), when we're moving a task that
> > + * was in %SCX_OPSS_DISPATCHING to a remote CPU's local
> > + * DSQ, we only set ops_state to %SCX_OPSS_NONE so that a
> > + * concurrent dequeue can proceed, but we clear
> > + * %SCX_TASK_IN_CUSTODY only when we later enqueue or move
> > + * the task. So we can see NONE + IN_CUSTODY here and we
> > + * must handle it.
> > + */
> > + if ((p->scx.flags & SCX_TASK_IN_CUSTODY) &&
> > + !task_scx_migrating(p)) {
> > + call_task_dequeue(sch, rq, p, op_deq_flags);
> > + p->scx.flags &= ~SCX_TASK_IN_CUSTODY;
> > + }
>
> Except for OPSS_QUEUED path, all call_task_dequeue() callers are using the
> same code block and OPSS_QUEUED can too. Can't we move the whole block into
> call_task_dequeue()?

Ack.

>
> > break;
> > case SCX_OPSS_QUEUEING:
> > /*
> > @@ -1539,9 +1638,18 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> > */
> > BUG();
> > case SCX_OPSS_QUEUED:
> > - if (SCX_HAS_OP(sch, dequeue))
> > - SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
> > - p, deq_flags);
> > + /*
> > + * Task is in BPF scheduler's custody (not dispatched yet).
> > + * Call ops.dequeue() unless this is an SCX-initiated
> > + * migration.
> > + *
> > + * A queued task must be always in BPF scheduler's custody.
> > + */
> > + WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_IN_CUSTODY));
> > + if (!task_scx_migrating(p)) {
> > + call_task_dequeue(sch, rq, p, op_deq_flags);
> > + p->scx.flags &= ~SCX_TASK_IN_CUSTODY;
> > + }
>
> This placement is a bit odd as before the following try_cmpxchg, this path
> doesn't have the ownership of the task. Please see below.
>
> > if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
> > SCX_OPSS_NONE))
> > @@ -1563,6 +1671,16 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
> > */
> > wait_ops_state(p, SCX_OPSS_DISPATCHING);
> > BUG_ON(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
> > +
> > + /*
> > + * After DISPATCHING completes, task may still be
> > + * IN_CUSTODY (see the NONE case).
> > + */
> > + if ((p->scx.flags & SCX_TASK_IN_CUSTODY) &&
> > + !task_scx_migrating(p)) {
> > + call_task_dequeue(sch, rq, p, op_deq_flags);
> > + p->scx.flags &= ~SCX_TASK_IN_CUSTODY;
> > + }
> > break;
> > }
> > }
>
> If you move QUEUED case after successful try_cmpxchg(), it's always calling
> dequeue before breaking out of the switch block. Might as well do it in a
> single spot right after the switch block?

Makes sense. I'll move this after the switch block.

Thanks!
-Andrea