Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

From: Andrea Righi

Date: Sat Feb 14 2026 - 05:17:22 EST

On Thu, Feb 12, 2026 at 11:30:14PM +0100, Andrea Righi wrote:
> On Thu, Feb 12, 2026 at 08:35:55AM -1000, Tejun Heo wrote:
> > Hello, Andrea.
> >
> > On Thu, Feb 12, 2026 at 07:14:13PM +0100, Andrea Righi wrote:
> > ...
> > > In ops.enqueue() the BPF scheduler doesn't necessarily pick a target CPU:
> > > it can put the task on an arbitrary DSQ or even in some internal BPF data
> > > structures. The task is still associated with a runqueue, but only to
> > > satisfy a kernel requirement, for sched_ext that association isn't
> > > meaningful, because the task isn't really "on" that CPU (in fact in
> > > ops.dispatch() can do the "last minute" migration).
> >
> > Yes.
> >
> > > Therefore, keeping accurate per-CPU information from the kernel's
> > > perspective doesn't buy us much, given that the BPF scheduler can keep
> > > tasks in its own queues or structures.
> > >
> > > Accurate PELT is still doable: the BPF scheduler can track where it puts
> > > each task in its own state, updates runnable load when it places the task
> > > in a DSQ / data structure and when the task leaves (dequeue). And it can
> > > use ops.running() / ops.stopping() for utilization.
> >
> > And the BPF sched might choose to do load aggregation at a differnt level
> > too - e.g. maybe per-CPU load metric doesn't make sense given the machine
> > and scheduler and only per-LLC level aggregation would be meaningful, which
> > would be true for multiple of the current SCX schedulers given the per-LLC
> > DSQ usage.
> >
> > > And with a proper ops.dequeue() semantics, PELT can be driven by the BPF
> > > scheduler's own placement and the scx callbacks, not by the specific rq a
> > > task is on.
> > >
> > > If all of the above makes sense for everyone, I agree that we don't need to
> > > notify all the internal migrations.
> >
> > Yeah, I think we're on the same page. BTW, I wonder whether we could use
> > p->scx.sticky_cpu to detect internal migrations. It's only used for internal
> > migrations, so maybe it can be used for detection.
>
> Perfect. And yes, I think if we set p->scx.sticky_cpu before
> deactivate_task() in move_remote_task_to_local_dsq(), then in ops_dequeue()
> we should be able to catch the internal migrations checking
> task_on_rq_migrating(p) && p->scx.sticky_cpu >= 0.
>
> I'll run some tests with that.

I ran more tests and I don't think we can simply rely on p->scx.sticky_cpu.

In particular, I don't see how to handle this scenario using only
p->scx.sticky_cpu: a task starts an internal migration, a sched_change
occurs, and ops.dequeue() gets skipped because p->scx.sticky_cpu >= 0.

So I'm back to the idea of introducing an SCX_TASK_MIGRATING_INTERNAL
flag...

-Andrea