Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

Next message: Reinette Chatre: "Re: [RFC PATCH 13/19] x86/resctrl: Add PLZA state tracking and context switch handling"
Previous message: Liam R. Howlett: "Re: [PATCH v3 2/4] mm: rename my_zero_pfn() to zero_pfn()"
In reply to: Andrea Righi: "Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics"
Next in thread: Andrea Righi: "Re: [PATCH 1/2] sched_ext: Fix ops.dequeue() semantics"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Tejun Heo

Date: Thu Feb 12 2026 - 13:35:59 EST

Hello, Andrea.

On Thu, Feb 12, 2026 at 07:14:13PM +0100, Andrea Righi wrote:
...
> In ops.enqueue() the BPF scheduler doesn't necessarily pick a target CPU:
> it can put the task on an arbitrary DSQ or even in some internal BPF data
> structures. The task is still associated with a runqueue, but only to
> satisfy a kernel requirement, for sched_ext that association isn't
> meaningful, because the task isn't really "on" that CPU (in fact in
> ops.dispatch() can do the "last minute" migration).

Yes.

> Therefore, keeping accurate per-CPU information from the kernel's
> perspective doesn't buy us much, given that the BPF scheduler can keep
> tasks in its own queues or structures.
>
> Accurate PELT is still doable: the BPF scheduler can track where it puts
> each task in its own state, updates runnable load when it places the task
> in a DSQ / data structure and when the task leaves (dequeue). And it can
> use ops.running() / ops.stopping() for utilization.

And the BPF sched might choose to do load aggregation at a differnt level
too - e.g. maybe per-CPU load metric doesn't make sense given the machine
and scheduler and only per-LLC level aggregation would be meaningful, which
would be true for multiple of the current SCX schedulers given the per-LLC
DSQ usage.

> And with a proper ops.dequeue() semantics, PELT can be driven by the BPF
> scheduler's own placement and the scx callbacks, not by the specific rq a
> task is on.
>
> If all of the above makes sense for everyone, I agree that we don't need to
> notify all the internal migrations.

Yeah, I think we're on the same page. BTW, I wonder whether we could use
p->scx.sticky_cpu to detect internal migrations. It's only used for internal
migrations, so maybe it can be used for detection.

Thanks.

--
tejun