Re: [PATCH v2 sched_ext/for-7.1] sched_ext: Invalidate dispatch decisions on CPU affinity changes

From: Kuba Piecuch

Date: Fri Mar 20 2026 - 05:18:26 EST


On Thu Mar 19, 2026 at 9:09 PM UTC, Andrea Righi wrote:
> On Thu, Mar 19, 2026 at 10:31:30AM +0000, Kuba Piecuch wrote:
>> On Thu Mar 19, 2026 at 8:35 AM UTC, Andrea Righi wrote:
>> > @@ -2537,9 +2546,26 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
>> > }
>> >
>> > if (src_rq != dst_rq &&
>> > - unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
>> > - dispatch_enqueue(sch, rq, find_global_dsq(sch, task_cpu(p)), p,
>> > - enq_flags | SCX_ENQ_CLEAR_OPSS | SCX_ENQ_GDSQ_FALLBACK);
>> > + unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, false))) {
>> > + /*
>> > + * Affinity changed after dispatch decision and the task
>> > + * can't run anymore on the destination rq.
>>
>> More of a nitpick, but this doesn't necessarily mean that the affinity changed.
>> The scheduler could have also issued an invalid dispatch to a CPU outside of
>> the task's cpumask (e.g. due to a bug), in which case the task won't be
>> re-enqueued if we simply drop the dispatch, correct?
>
> That's right, the scheduler could have issues an invalid dispatch and in
> that case we would just drop the task on the floor, which is not really
> nice, it'd be better to immediately error in this case. And we don't need
> the global DSQ fallback, since we're erroring anyway.
>
> I need to rethink this part...

The fundamental problem here is differentiating between buggy dispatches that
should have never been issued and dispatches that were valid at the moment
the BPF scheduler was preparing the task for dispatch, but became invalid due
to racing cpumask changes.

The crucial observation is that SCX will only detect racing dequeues/enqueues
if they race with the window between scx_bpf_dsq_insert() and finish_dispatch().
That's because scx_bpf_dsq_insert() stores a snapshot of the task's current
qseq value, which is compared in finish_dispatch().

The BPF-side cpumask checks traditionally happen outside of this window, making
finish_dispatch() incapable of detecting cpumask changes that raced with the
BPF-side check but happened strictly before scx_bpf_dsq_insert().

To resolve this, we need to extend the race detection window so that it
includes the BPF-side checks.

The simple way to do this is to do scx_bpf_dsq_insert() at the very beginning,
once we know which task we would like to dispatch, and cancel the pending
dispatch via scx_bpf_dispatch_cancel() if any of the pre-dispatch checks fail
on the BPF side. This way, the "critical section" includes BPF-side checks, and
SCX will ignore the dispatch if there was a dequeue/enqueue racing with the
critical section.

With this solution, we can throw an error if task_can_run_on_remote_rq() is
false, because we know that there was no racing cpumask change (if there was,
it would have been caught earlier, in finish_dispatch()).