Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

From: Andrea Righi

Date: Mon Feb 02 2026 - 10:38:46 EST


On Mon, Feb 02, 2026 at 10:02:30AM +0000, Christian Loehle wrote:
> On 2/2/26 09:26, Andrea Righi wrote:
> > On Mon, Feb 02, 2026 at 08:45:18AM +0100, Andrea Righi wrote:
> > ...
> >>> So I have finally gotten around updating scx_storm to the new semantics,
> >>> see:
> >>> https://github.com/cloehle/scx/tree/cloehle/scx-storm-qmap-insert-local-dequeue-semantics
> >>>
> >>> I don't think the new ops.dequeue() are enough to make inserts to local-on
> >>> from anywhere safe, because it's still racing with dequeue from another CPU?
> >>
> >> Yeah, with this patch set BPF schedulers get proper ops.dequeue()
> >> callbacks, but we're not fixing the usage of SCX_DSQ_LOCAL_ON from
> >> ops.dispatch().
> >>
> >> When task properties change between scx_bpf_dsq_insert() and the actual
> >> dispatch, task_can_run_on_remote_rq() can still trigger a fatal
> >> scx_error().
> >>
> >> The ops.dequeue(SCX_DEQ_SCHED_CHANGE) notifications happens after the
> >> property change, so it can't prevent already-queued dispatches from
> >> failing. The race window is between ops.dispatch() returning and
> >> dispatch_to_local_dsq() executing.
> >>
> >> We can address this in a separate patch set. One thing at a time. :)
> >
> > Thinking more on this, the problem is that we're passing enforce=true to
> > task_can_run_on_remote_rq(), triggering a critical failure - scx_error().
> > There's a logic in task_can_run_on_remote_rq() to fallback to the global
> > DSQ, that doesn't happen if we pass enforce=true, due to scx_error().
> >
> > However, instead of the global DSQ fallback, I was wondering if it'd be
> > better to simply re-enqueue the task - setting SCX_ENQ_REENQ - if the
> > target local DSQ isn't valid anymore when the dispatch is finalized.
> >
> > In this way using SCX_DSQ_LOCAL_ON | cpu from ops.dispatch() would simply
> > trigger a re-enqueue when "cpu" isn't valid anymore (due to concurrent
> > affinity / migration disabled changes) and the BPF scheduler can handle
> > that in another ops.enqueue().
> >
> > What do you think?
>
> I think that's a lot more versatile for the BPF scheduler than using the
> global DSQ as fallback in that case, so yeah I'm all for it!
>

Ack, I already have a working patch do to this, I'll post it as a separate
patch set.

Thanks,
-Andrea