Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

From: Andrea Righi

Date: Thu Feb 12 2026 - 10:46:10 EST

On Thu, Feb 12, 2026 at 02:32:02PM +0000, Christian Loehle wrote:
> On 2/12/26 10:16, Andrea Righi wrote:
> > On Wed, Feb 11, 2026 at 12:37:13PM -1000, Tejun Heo wrote:
> >> Hello,
> >>
> >> On Wed, Feb 11, 2026 at 11:34:54PM +0100, Andrea Righi wrote:
> >>>> The end result is about the same because whenever we migrate we're sending
> >>>> it to the local DSQ of the destination CPU, so whether we generate the event
> >>>> on deactivation of the source CPU or activation on the destination doesn't
> >>>> make *whole* lot of difference. However, conceptually, migrations are
> >>>> internal events. There isn't anything actionable for the BPF scheduler. The
> >>>> reason why ops.dequeue() should be emitted is not because the task is
> >>>> changing CPUs (which caused the deactivation) but the fact that it ends up
> >>>> in a local DSQ afterwards. I think it'll be cleaner both conceptually and
> >>>> code-wise to emit ops.dequeue() only from dispatch_enqueue() and dequeue
> >>>> paths.
> >>>
> >>> Does this include core scheduler migrations or just SCX-initiated
> >>> migrations (move_remote_task_to_local_dsq())?
> >>>
> >>> Because with core scheduler migrations we trigger ops.enqueue(), so we
> >>> should also trigger ops.dequeue(). Or we need to send the task straight to
> >>> local to prevent calling ops.enqueue().
> >>
> >> I'm a bit lost. Can you elaborate on core scheduler migrations triggering
> >> ops.enqueue()?
> >
> > Alright, let me re-elaborate more on this with a (slightly) fresher brain.
> >
> > We have two main classes of migrations:
> >
> > 1) Internal SCX-initiated migrations: e.g.,
> > dispatch_to_local_dsq() -> move_remote_task_to_local_dsq(), or
> > consume_remote_task() -> move_remote_task_to_local_dsq(), these
> > are completely internal to SCX and shouldn't trigger
> > ops.dequeue/enqueue()
> >
> > 2) Core scheduler migrations
> > - CPU affinity: sched_setaffinity, cpuset/cgroup mask change, etc.
> > affine_move_task -> move_queued_task migrates it -> we trigger
> > ops.dequeue(SCX_DEQ_SCHED_CHANGE) on the source and ops.enqueue() on
> > the target.
> >
> > - Core scheduling (CONFIG_SCHED_CORE): two different cases:
> > - Migration (task moved between runqueues via move_queued_task_locked()
> > to satisfy core cookie)
> >
> > - NUMA balancing: migrate_task_to() can move an SCX task to another CPU
> >
> > - CPU hotplug: on CPU down, runnable tasks are pushed off via
> > __balance_push_cpu_stop() -> __migrate_task()
> >
> > If we want to skip ops.dequeue() only for internal SCX migrations (and
> > maybe also for NUMA and hotplug?), then only checking
> > task_on_rq_migrating(p) is not enough, because that's true for every
> > migration listed above and we'd skip all of them.
> >
> > So, we need a way to mark "this migration is internal to SCX", like a new
> > SCX_TASK_MIGRATING_INTERNAL flag?
> >
> > The alternative is to always trigger ops.dequeue/enqueue() on every
> > migration (no flag): even for internal SCX migrations the BPF scheduler
> > could use it to track task movements, though there's nothing it can do.
> > That way we don't need the additional flag.
> >
> > Does one of these directions fit better with what you have in mind?
> IIUC one example might sway your opinion (or not):
> Note that not receiving a ops.dequeue() for tasks leaving one LOCAL_DSQ
> (and maybe being enqueued at another) prevents e.g. accurate PELT load
> tracking on the BPF side.
> Regular utilization tracking works through ops.running() and
> ops.stopping() but load I don't think load can be implemented accurately.

It makes sense to me and I think it's actually valid reason to prefer the
"always trigger" way.

We have DSQs and potentially BPF can have its own queues, but to implement
accurate PELT (runnable contribution to a runqueue, possibly with decay),
we'd also need to know exactly when a task leaves one runqueue and joins
another.

Essentially we could get the full task lifecyle in BPF:
- runnable lifecycle:
- ops.dequeue(): task leaves runqueue, source CPU = scx_bpf_task_cpu(p),
- ops.enqueue(): task wants to run, curr CPU = scx_bpf_task_cpu(p),
- running lifecycle:
- ops.running(p): task starts running on scx_bpf_task_cpu(p),
- ops.stopping(p): task stops running on scx_bpf_task_cpu(p).

A potential concern could be about introducing more overhead, but I don't
think it matters much, especially since schedulers that don't implement
ops.dequeue() effectively pay no cost for these events.

Thanks,
-Andrea