Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

From: Andrea Righi

Date: Thu Feb 12 2026 - 13:14:51 EST

On Thu, Feb 12, 2026 at 07:07:05AM -1000, Tejun Heo wrote:
> Hello,
>
> On Thu, Feb 12, 2026 at 04:45:43PM +0100, Andrea Righi wrote:
> > > > So, we need a way to mark "this migration is internal to SCX", like a new
> > > > SCX_TASK_MIGRATING_INTERNAL flag?
>
> Yeah, I think this is what we should do. That's the only ops.dequeue()
> without matching ops.enqueue(), right?

Correct.

>
> ...
> > > IIUC one example might sway your opinion (or not):
> > > Note that not receiving a ops.dequeue() for tasks leaving one LOCAL_DSQ
> > > (and maybe being enqueued at another) prevents e.g. accurate PELT load
> > > tracking on the BPF side.
> > > Regular utilization tracking works through ops.running() and
> > > ops.stopping() but load I don't think load can be implemented accurately.
> >
> > It makes sense to me and I think it's actually valid reason to prefer the
> > "always trigger" way.
>
> I don't think this is a valid argument. PELT is done that way because the
> association of the task and the CPU is meaningful for in-kernel schedulers.
> The queues are actually per-CPU. For SCX scheds, the relationship is not
> known to the kernel. Only the BPF scheduler itself knows, if it wants to
> attribute per-task load to a specific CPU, which CPU it should be attributed
> to. What's the point of following in-kernel association for PELT if the task
> was going to be hot migrated to another CPU on execution?

I see, let me elaborate more on this to make sure we're on the same page.

In ops.enqueue() the BPF scheduler doesn't necessarily pick a target CPU:
it can put the task on an arbitrary DSQ or even in some internal BPF data
structures. The task is still associated with a runqueue, but only to
satisfy a kernel requirement, for sched_ext that association isn't
meaningful, because the task isn't really "on" that CPU (in fact in
ops.dispatch() can do the "last minute" migration).

Therefore, keeping accurate per-CPU information from the kernel's
perspective doesn't buy us much, given that the BPF scheduler can keep
tasks in its own queues or structures.

Accurate PELT is still doable: the BPF scheduler can track where it puts
each task in its own state, updates runnable load when it places the task
in a DSQ / data structure and when the task leaves (dequeue). And it can
use ops.running() / ops.stopping() for utilization.

And with a proper ops.dequeue() semantics, PELT can be driven by the BPF
scheduler's own placement and the scx callbacks, not by the specific rq a
task is on.

If all of the above makes sense for everyone, I agree that we don't need to
notify all the internal migrations.

Thanks,
-Andrea