Re: [PATCH v2 sched_ext/for-7.1] sched_ext: Invalidate dispatch decisions on CPU affinity changes

From: Cheng-Yang Chou

Date: Fri May 01 2026 - 12:20:26 EST

Hi Kuba,

On Mon, Apr 27, 2026 at 09:06:21AM +0000, Kuba Piecuch wrote:
> > On Thu, Apr 23, 2026 at 01:32:20PM +0000, Kuba Piecuch wrote:
> >> > On Mon, Mar 23, 2026 at 01:13:20PM -1000, Tejun Heo wrote:
> >> >> > The simple way to do this is to do scx_bpf_dsq_insert() at the very beginning,
> >> >> > once we know which task we would like to dispatch, and cancel the pending
> >> >> > dispatch via scx_bpf_dispatch_cancel() if any of the pre-dispatch checks fail
> >> >> > on the BPF side. This way, the "critical section" includes BPF-side checks, and
> >> >> > SCX will ignore the dispatch if there was a dequeue/enqueue racing with the
> >> >> > critical section.
> >> >> >
> >> >> > With this solution, we can throw an error if task_can_run_on_remote_rq() is
> >> >> > false, because we know that there was no racing cpumask change (if there was,
> >> >> > it would have been caught earlier, in finish_dispatch()).
> >> >>
> >> >> Yeah, I think this makes more sense. qseq is already there to provide
> >> >> protection against these events. It's just that the capturing of qseq is too
> >> >> late. If insert/cancel is too ugly, we can introduce another kfunc to
> >> >> capture the qseq - scx_bpf_dsq_insert_begin() or something like that - and
> >> >> stash it in a per-cpu variable. That way, qseq would be cover the "current"
> >> >> queued instance and the existing qseq mechanism would be able to reliably
> >> >> ignore the ones that lost race to dequeue.
> >> >
> >> > Since this has been stale for a while, I prepared a patch to implement
> >> > scx_bpf_dsq_insert_begin() as suggested.
> >>
> >> Thanks for creating the patch. A couple of thoughts:
> >>
> >> 1. Do we have a use case that requires dsq_insert_begin() that isn't
> >> satisfied using the "insert and then cancel if needed" approach?
> >
> > IIUC, yes. scx_bpf_dispatch_cancel() is only registered in
> > scx_kfunc_ids_dispatch, so it is only callable from ops.dispatch().
> > dsq_insert_begin(), on the other hand, is available from both
> > ops.enqueue() and ops.dispatch() (SCX_KF_ENQUEUE | SCX_KF_DISPATCH).
> > Since there is nothing to cancel in ops.enqueue(), the insert-and-cancel
> > approach simply doesn't work there.
>
> Wouldn't the natural thing then be to extend scx_bpf_dispatch_cancel() to
> work for direct dispatch? Instead of introducing a whole new mechanism, let's
> extend the one we have by functionality that it (arguably) should have had
> from the beginning.

I see. You're right that dispatch_cancel() could be extended to work in
the enqueue context.

I'm happy to go either direction, your approach or Tejun's suggestion.
Tejun, Andrea, sched-ext folks, thoughts?

>
> >
> >>
> >> 2. Do we want to restrict ourselves through the one qseq slot provided by
> >> dsq_insert_begin()? The most flexible approach IMO would be to simply
> >> allow BPF to read the qseq directly via a kfunc and then supply it to
> >> dsq_insert() later. With this, we can have multiple qseqs saved at the
> >> same time, and we can even pass them between CPUs, e.g. if one CPU
> >> dequeues a task for a sibling CPU, but we want the checks to be made inside
> >> the sibling's ops.dispatch() (I just made this use case it up, it may not
> >> be practical.)
> >> That said, exposing an internal thing like qseq to BPF may be a step too far.
> >
> > In Tejun's reply back in [1], he suggested dsq_insert_begin() precisely
> > to avoid promoting qseq into the BPF ABI — which matches your own concern.
> > The single per-CPU slot is sufficient for the one-task-per-iteration
> > dispatch loops used by existing schedulers (e.g., scx_central).
> > If a concrete cross-CPU use case materializes later, we can always extend
> > dsq_insert() to accept an explicit qseq without breaking the current,
> > simpler path.
> >
> > [1]: https://lore.kernel.org/all/acHJED4iAeytdC2l@xxxxxxxxxxxxxxx/
> >
>
> Well, Tejun doesn't explicitly say there that he's against exposing qseq, but
> I won't be surprised if he is.
>
> FWIW, ghOSt (our Google-internal BPF scheduling solution) uses exactly this
> approach to guard the dispatch path against racing dequeues/enqueues.
> Every task has a seqnum that gets incremented on each "event" pertaining to
> the task. In the dispatch path, the BPF scheduler reads the task seqnum,
> does whatever checks it needs to do, and passes the seqnum to ghOSt at the end.
>
> Admittedly, what works downstream doesn't have to work upstream, but I still
> wanted to provide this data point :-)

The ghOSt data point is appreciated. If a concrete use case emerges where
the single-slot approach falls short, extending dsq_insert() to accept an
explicit qseq seems like a natural next step.

Tejun, Andrea, sched-ext folks, any preferences?

--
Cheers,
Cheng-Yang