[PATCHSET v9] sched_ext: Fix ops.dequeue() semantics
From: Andrea Righi
Date: Sun Feb 15 2026 - 14:20:01 EST
The callback ops.dequeue() is provided to let BPF schedulers observe when a
task leaves the scheduler, either because it is dispatched or due to a task
property change. However, this callback is currently unreliable and not
invoked systematically, which can result in missed ops.dequeue() events.
In particular, once a task is removed from the scheduler (whether for
dispatch or due to a property change) the BPF scheduler loses visibility of
the task and the sched_ext core may not always trigger ops.dequeue().
This breaks accurate accounting (i.e., per-DSQ queued runtime sums) and
prevents reliable tracking of task lifecycle transitions.
This patch set fixes the semantics of ops.dequeue(), by guaranteeing that
each task entering the BPF scheduler's custody triggers exactly one
ops.dequeue() call when it leaves that custody, whether the exit is due to
a dispatch (regular or via a core scheduling pick) or to a scheduling
property change (e.g., sched_setaffinity(), sched_setscheduler(),
set_user_nice(), NUMA balancing, etc.).
To identify property change dequeues a new ops.dequeue() flag is
introduced: %SCX_DEQ_SCHED_CHANGE.
Together, these changes allow BPF schedulers to reliably track task
ownership and maintain accurate accounting.
Changes in v9:
- Ignore internal SCX migrations: do not notify BPF schedulers for
internal enqueue/dequeue events
- Use sticky_cpu to determine when a task is undergoing an internal
SCX migration
- Trigger ops.dequeue() consistently from ops_dequeue() or when directly
dispatching to terminal DSQs
- Add preliminary patches to refactor dispatch_enqueue() and properly mark
internal migrations using sticky_cpu
- Link to v8:
https://lore.kernel.org/all/20260210212813.796548-1-arighi@xxxxxxxxxx
Changes in v8:
- Rename SCX_TASK_NEED_DEQ -> SCX_TASK_IN_CUSTODY and set/clear this flag
also when ops.dequeue() is not implemented (can be used for other
purposes in the future)
- Clarify ops.select_cpu() behavior: dispatch to terminal DSQs doesn't
trigger ops.dequeue(), dispatch to user DSQs triggers ops.dequeue(),
store to BPF-internal data structure is discouraged
- Link to v7:
https://lore.kernel.org/all/20260206135742.2339918-1-arighi@xxxxxxxxxx
Changes in v7:
- Handle tasks stored to BPF internal data structures (trigger
ops.dequeue())
- Add a kselftest scenario with a BPF queue to verify ops.dequeue()
behavior with tasks stored in internal BPF data structures
- Link to v6:
https://lore.kernel.org/all/20260205153304.1996142-1-arighi@xxxxxxxxxx
Changes in v6:
- Rename SCX_TASK_OPS_ENQUEUED -> SCX_TASK_NEED_DSQ
- Use SCX_DSQ_FLAG_BUILTIN in is_terminal_dsq() to check for all builtin
DSQs (local, global, bypass)
- centralize ops.dequeue() logic in dispatch_enqueue()
- Remove "Property Change Notifications for Running Tasks" section from
the documentation
- The kselftest now validates the right behavior both from ops.enqueue()
and ops.select_cpu()
- Link to v5: https://lore.kernel.org/all/20260204160710.1475802-1-arighi@xxxxxxxxxx
Changes in v5:
- Introduce the concept of "terminal DSQ" (when a task is dispatched to a
terminal DSQ, the task leaves the BPF scheduler's custody)
- Consider SCX_DSQ_GLOBAL as a terminal DSQ
- Link to v4: https://lore.kernel.org/all/20260201091318.178710-1-arighi@xxxxxxxxxx
Changes in v4:
- Introduce the concept of "BPF scheduler custody"
- Do not trigger ops.dequeue() for direct dispatches to local DSQs
- Trigger ops.dequeue() only once; after the task leaves BPF scheduler
custody, further dequeue events are not reported.
- Link to v3: https://lore.kernel.org/all/20260126084258.3798129-1-arighi@xxxxxxxxxx
Changes in v3:
- Rename SCX_DEQ_ASYNC to SCX_DEQ_SCHED_CHANGE
- Handle core-sched dequeues (Kuba)
- Link to v2: https://lore.kernel.org/all/20260121123118.964704-1-arighi@xxxxxxxxxx
Changes in v2:
- Distinguish between "dispatch" dequeues and "property change" dequeues
(flag SCX_DEQ_ASYNC)
- Link to v1: https://lore.kernel.org/all/20251219224450.2537941-1-arighi@xxxxxxxxxx
Andrea Righi (4):
sched_ext: Properly mark SCX-internal migrations via sticky_cpu
sched_ext: Add rq parameter to dispatch_enqueue()
sched_ext: Fix ops.dequeue() semantics
selftests/sched_ext: Add test to validate ops.dequeue() semantics
Documentation/scheduler/sched-ext.rst | 78 ++++-
include/linux/sched/ext.h | 1 +
kernel/sched/ext.c | 180 ++++++++++--
kernel/sched/ext_internal.h | 7 +
tools/sched_ext/include/scx/enum_defs.autogen.h | 1 +
tools/sched_ext/include/scx/enums.autogen.bpf.h | 2 +
tools/sched_ext/include/scx/enums.autogen.h | 1 +
tools/testing/selftests/sched_ext/Makefile | 1 +
tools/testing/selftests/sched_ext/dequeue.bpf.c | 367 ++++++++++++++++++++++++
tools/testing/selftests/sched_ext/dequeue.c | 265 +++++++++++++++++
10 files changed, 874 insertions(+), 29 deletions(-)
create mode 100644 tools/testing/selftests/sched_ext/dequeue.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/dequeue.c