Re: [PATCH] sched_ext: Invalidate dispatch decisions on CPU affinity changes

From: Andrea Righi

Date: Wed Feb 04 2026 - 13:01:15 EST


On Wed, Feb 04, 2026 at 04:58:47PM +0000, Kuba Piecuch wrote:
> On Wed Feb 4, 2026 at 3:36 PM UTC, Andrea Righi wrote:
> >> >
> >> > When finish_dispatch() detects a qseq mismatch, the dispatch is dropped
> >> > and the task is returned to the SCX_OPSS_QUEUED state, allowing it to be
> >> > re-dispatched using up-to-date affinity information.
> >>
> >> How will the scheduler know that the dispatch was dropped? Is the scheduler
> >> expected to infer it from the ops.enqueue() that follows set_cpus_allowed_scx()
> >> on CPU1?
> >
> > The idea was that, if the dispatch is dropped, we'll see another
> > ops.enqueue() for the task, so at least the task is not "lost" and the
> > BPF scheduler gets another chance what to do with it. In this case it'd be
> > useful to set SCX_ENQ_REENQ (or a dedicated special flag) to indicate that
> > the enqueue resulted from a dropped dispatch.
>
> I think SCX_ENQ_REENQ is enough for now, we can always add a dedicated flag
> if a need for it arises.
>
> I still worry about the scenario you described. In particular, I think it can
> lead to tasks being forgotten (i.e. not re-enqueued) after a failed dispatch.
>
> CPU0 CPU1
> ---- ----
> if (cpumask_test_cpu(cpu, p->cpus_ptr))
> task_rq_lock(p)
> dequeue_task_scx(p, ...)
> (remove p from internal queues)
> set_cpus_allowed_scx(p, new_mask)
> enqueue_task_scx(p, ...)
> (add p to internal queues)
> task_rq_unlock(p)
> (remove p from internal queues)
> scx_bpf_dsq_insert(p,
> SCX_DSQ_LOCAL_ON | cpu, 0)
>
> In this scenario, the ops.enqueue() which is supposed to notify the BPF
> scheduler about the failed dispatch actually happens _before_ the actual
> dispatch, so once the dispatch fails, the task won't be re-enqueued.
>
> There are two problems here:
>
> 1. CPU0 makes a scheduling decision based on stale data and it isn't detected.
> 2. Even if it is detected and the dispatch aborted, the task won't be
> re-enqueued.

Right. At this point I think we can just rely on the affinity validation
via task_can_run_on_remote_rq(), where p->cpus_ptr is always stable and
just drop invalid dispatches.

And to prevent dropped tasks, I was wondering if we could just insert the
task into a per-rq fallback DSQ, that can be consumed from balance_scx() to
re-enqueue the task (setting SCX_ENQ_REENQ). This should solve the
re-enqueue problem avoiding the locking complexity of calling ops.enqueue()
directly from finish_dispatch().

Thoughts?

>
> The way we deal with the first problem in ghOSt (Google's equivalent of
> sched_ext) is we expose the per-task sequence number to the BPF scheduler.
> On the dispatch path, when the BPF scheduler has a candidate task,
> it retrieves its seqnum, re-checks the task state to ensure that it's still
> eligible for dispatch, and passes the seqnum to the kernel's dispatch helper
> for verification. If the kernel detects that the seqnum has changed already,
> it synchronously fails the dispatch attempt (dispatch always happens
> synchronously in ghOSt). In sched_ext, we could do the synchronous check, but
> we also need to do the same check later in finish_dispatch(), comparing
> the current qseq against the qseq passed by the BPF scheduler.
>
> To fix the second problem, we would need to explicitly call ops.enqueue()
> from finish_dispatch() and the other places where we abort dispatch if the
> qseq is out of date.
>
> Either that, or just add locking to the BPF scheduler to prevent the race from
> happening in the first place.

Thanks,
-Andrea