Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

From: Christian Loehle

Date: Thu Feb 12 2026 - 09:32:17 EST

On 2/12/26 10:16, Andrea Righi wrote:
> On Wed, Feb 11, 2026 at 12:37:13PM -1000, Tejun Heo wrote:
>> Hello,
>>
>> On Wed, Feb 11, 2026 at 11:34:54PM +0100, Andrea Righi wrote:
>>>> The end result is about the same because whenever we migrate we're sending
>>>> it to the local DSQ of the destination CPU, so whether we generate the event
>>>> on deactivation of the source CPU or activation on the destination doesn't
>>>> make *whole* lot of difference. However, conceptually, migrations are
>>>> internal events. There isn't anything actionable for the BPF scheduler. The
>>>> reason why ops.dequeue() should be emitted is not because the task is
>>>> changing CPUs (which caused the deactivation) but the fact that it ends up
>>>> in a local DSQ afterwards. I think it'll be cleaner both conceptually and
>>>> code-wise to emit ops.dequeue() only from dispatch_enqueue() and dequeue
>>>> paths.
>>>
>>> Does this include core scheduler migrations or just SCX-initiated
>>> migrations (move_remote_task_to_local_dsq())?
>>>
>>> Because with core scheduler migrations we trigger ops.enqueue(), so we
>>> should also trigger ops.dequeue(). Or we need to send the task straight to
>>> local to prevent calling ops.enqueue().
>>
>> I'm a bit lost. Can you elaborate on core scheduler migrations triggering
>> ops.enqueue()?
>
> Alright, let me re-elaborate more on this with a (slightly) fresher brain.
>
> We have two main classes of migrations:
>
> 1) Internal SCX-initiated migrations: e.g.,
> dispatch_to_local_dsq() -> move_remote_task_to_local_dsq(), or
> consume_remote_task() -> move_remote_task_to_local_dsq(), these
> are completely internal to SCX and shouldn't trigger
> ops.dequeue/enqueue()
>
> 2) Core scheduler migrations
> - CPU affinity: sched_setaffinity, cpuset/cgroup mask change, etc.
> affine_move_task -> move_queued_task migrates it -> we trigger
> ops.dequeue(SCX_DEQ_SCHED_CHANGE) on the source and ops.enqueue() on
> the target.
>
> - Core scheduling (CONFIG_SCHED_CORE): two different cases:
> - Migration (task moved between runqueues via move_queued_task_locked()
> to satisfy core cookie)
>
> - NUMA balancing: migrate_task_to() can move an SCX task to another CPU
>
> - CPU hotplug: on CPU down, runnable tasks are pushed off via
> __balance_push_cpu_stop() -> __migrate_task()
>
> If we want to skip ops.dequeue() only for internal SCX migrations (and
> maybe also for NUMA and hotplug?), then only checking
> task_on_rq_migrating(p) is not enough, because that's true for every
> migration listed above and we'd skip all of them.
>
> So, we need a way to mark "this migration is internal to SCX", like a new
> SCX_TASK_MIGRATING_INTERNAL flag?
>
> The alternative is to always trigger ops.dequeue/enqueue() on every
> migration (no flag): even for internal SCX migrations the BPF scheduler
> could use it to track task movements, though there's nothing it can do.
> That way we don't need the additional flag.
>
> Does one of these directions fit better with what you have in mind?
IIUC one example might sway your opinion (or not):
Note that not receiving a ops.dequeue() for tasks leaving one LOCAL_DSQ
(and maybe being enqueued at another) prevents e.g. accurate PELT load
tracking on the BPF side.
Regular utilization tracking works through ops.running() and
ops.stopping() but load I don't think load can be implemented accurately.