[PATCH 1/2] sched_ext: Fix ops.dequeue() semantics

From: Andrea Righi

Date: Tue Feb 10 2026 - 16:28:55 EST

Currently, ops.dequeue() is only invoked when the sched_ext core knows
that a task resides in BPF-managed data structures, which causes it to
miss scheduling property change events. In addition, ops.dequeue()
callbacks are completely skipped when tasks are dispatched to non-local
DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
track task state.

Fix this by guaranteeing that each task entering the BPF scheduler's
custody triggers exactly one ops.dequeue() call when it leaves that
custody, whether the exit is due to a dispatch (regular or via a core
scheduling pick) or to a scheduling property change (e.g.
sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
balancing, etc.).

BPF scheduler custody concept: a task is considered to be in the BPF
scheduler's custody when the scheduler is responsible for managing its
lifecycle. This includes tasks dispatched to user-created DSQs or stored
in the BPF scheduler's internal data structures from ops.enqueue().
Custody ends when the task is dispatched to a terminal DSQ (such as the
local DSQ or %SCX_DSQ_GLOBAL), selected by core scheduling, or removed
due to a property change.

Tasks directly dispatched to terminal DSQs bypass the BPF scheduler
entirely and are never in its custody. Terminal DSQs include:
- Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
where tasks go directly to execution.
- Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the
BPF scheduler is considered "done" with the task.

As a result, ops.dequeue() is not invoked for tasks directly dispatched
to terminal DSQs.

To identify dequeues triggered by scheduling property changes, introduce
the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
the dequeue was caused by a scheduling property change.

New ops.dequeue() semantics:
- ops.dequeue() is invoked exactly once when the task leaves the BPF
scheduler's custody, in one of the following cases:
a) regular dispatch: a task dispatched to a user DSQ or stored in
internal BPF data structures is moved to a terminal DSQ
(ops.dequeue() called without any special flags set),
b) core scheduling dispatch: core-sched picks task before dispatch
(ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set),
c) property change: task properties modified before dispatch,
(ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set).

This allows BPF schedulers to:
- reliably track task ownership and lifecycle,
- maintain accurate accounting of managed tasks,
- update internal state when tasks change properties.

Cc: Tejun Heo <tj@xxxxxxxxxx>
Cc: Emil Tsalapatis <emil@xxxxxxxxxxxxxxx>
Cc: Kuba Piecuch <jpiecuch@xxxxxxxxxx>
Signed-off-by: Andrea Righi <arighi@xxxxxxxxxx>
---
Documentation/scheduler/sched-ext.rst | 78 ++++++++-
include/linux/sched/ext.h | 1 +
kernel/sched/ext.c | 155 ++++++++++++++++--
kernel/sched/ext_internal.h | 7 +
.../sched_ext/include/scx/enum_defs.autogen.h | 1 +
.../sched_ext/include/scx/enums.autogen.bpf.h | 2 +
tools/sched_ext/include/scx/enums.autogen.h | 1 +
7 files changed, 221 insertions(+), 24 deletions(-)

diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
index 404fe6126a769..21c65e504da7c 100644
--- a/Documentation/scheduler/sched-ext.rst
+++ b/Documentation/scheduler/sched-ext.rst
@@ -229,16 +229,23 @@ The following briefly shows how a waking task is scheduled and executed.
scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper,
using ``ops.select_cpu()`` judiciously can be simpler and more efficient.

- A task can be immediately inserted into a DSQ from ``ops.select_cpu()``
- by calling ``scx_bpf_dsq_insert()``. If the task is inserted into
- ``SCX_DSQ_LOCAL`` from ``ops.select_cpu()``, it will be inserted into the
- local DSQ of whichever CPU is returned from ``ops.select_cpu()``.
- Additionally, inserting directly from ``ops.select_cpu()`` will cause the
- ``ops.enqueue()`` callback to be skipped.
-
Note that the scheduler core will ignore an invalid CPU selection, for
example, if it's outside the allowed cpumask of the task.

+ A task can be immediately inserted into a DSQ from ``ops.select_cpu()``
+ by calling ``scx_bpf_dsq_insert()`` or ``scx_bpf_dsq_insert_vtime()``.
+
+ If the task is inserted into ``SCX_DSQ_LOCAL`` from
+ ``ops.select_cpu()``, it will be added to the local DSQ of whichever CPU
+ is returned from ``ops.select_cpu()``. Additionally, inserting directly
+ from ``ops.select_cpu()`` will cause the ``ops.enqueue()`` callback to
+ be skipped.
+
+ Any other attempt to store a task in BPF-internal data structures from
+ ``ops.select_cpu()`` does not prevent ``ops.enqueue()`` from being
+ invoked. This is discouraged, as it can introduce racy or inconsistent
+ state.
+
2. Once the target CPU is selected, ``ops.enqueue()`` is invoked (unless the
task was inserted directly from ``ops.select_cpu()``). ``ops.enqueue()``
can make one of the following decisions:
@@ -252,6 +259,61 @@ The following briefly shows how a waking task is scheduled and executed.

* Queue the task on the BPF side.

+ **Task State Tracking and ops.dequeue() Semantics**
+
+ A task is in the "BPF scheduler's custody" when the BPF scheduler is
+ responsible for managing its lifecycle. A task enters custody when it is
+ dispatched to a user DSQ or stored in the BPF scheduler's internal data
+ structures. Custody is entered only from ``ops.enqueue()`` for those
+ operations. The only exception is dispatching to a user DSQ from
+ ``ops.select_cpu()``: although the task is not yet technically in BPF
+ scheduler custody at that point, the dispatch has the same semantic
+ effect as dispatching from ``ops.enqueue()`` for custody-related
+ semantics.
+
+ Once ``ops.enqueue()`` is called, the task may or may not enter custody
+ depending on what the scheduler does:
+
+ * **Directly dispatched to terminal DSQs** (``SCX_DSQ_LOCAL``,
+ ``SCX_DSQ_LOCAL_ON | cpu``, or ``SCX_DSQ_GLOBAL``): the BPF scheduler
+ is done with the task - it either goes straight to a CPU's local run
+ queue or to the global DSQ as a fallback. The task never enters (or
+ exits) BPF custody, and ``ops.dequeue()`` will not be called.
+
+ * **Dispatch to user-created DSQs** (custom DSQs): the task enters the
+ BPF scheduler's custody. When the task later leaves BPF custody
+ (dispatched to a terminal DSQ, picked by core-sched, or dequeued for
+ sleep/property changes), ``ops.dequeue()`` will be called exactly
+ once.
+
+ * **Stored in BPF data structures** (e.g., internal BPF queues): the
+ task is in BPF custody. ``ops.dequeue()`` will be called when it
+ leaves (e.g., when ``ops.dispatch()`` moves it to a terminal DSQ, or
+ on property change / sleep).
+
+ When a task leaves BPF scheduler custody, ``ops.dequeue()`` is invoked.
+ The dequeue can happen for different reasons, distinguished by flags:
+
+ 1. **Regular dispatch**: when a task in BPF custody is dispatched to a
+ terminal DSQ from ``ops.dispatch()`` (leaving BPF custody for
+ execution), ``ops.dequeue()`` is triggered without any special flags.
+
+ 2. **Core scheduling pick**: when ``CONFIG_SCHED_CORE`` is enabled and
+ core scheduling picks a task for execution while it's still in BPF
+ custody, ``ops.dequeue()`` is called with the
+ ``SCX_DEQ_CORE_SCHED_EXEC`` flag.
+
+ 3. **Scheduling property change**: when a task property changes (via
+ operations like ``sched_setaffinity()``, ``sched_setscheduler()``,
+ priority changes, CPU migrations, etc.) while the task is still in
+ BPF custody, ``ops.dequeue()`` is called with the
+ ``SCX_DEQ_SCHED_CHANGE`` flag set in ``deq_flags``.
+
+ **Important**: Once a task has left BPF custody (e.g. after being
+ dispatched to a terminal DSQ), property changes will not trigger
+ ``ops.dequeue()``, since the task is no longer managed by the BPF
+ scheduler.
+
3. When a CPU is ready to schedule, it first looks at its local DSQ. If
empty, it then looks at the global DSQ. If there still isn't a task to
run, ``ops.dispatch()`` is invoked which can use the following two
@@ -319,6 +381,8 @@ by a sched_ext scheduler:
/* Any usable CPU becomes available */

ops.dispatch(); /* Task is moved to a local DSQ */
+
+ ops.dequeue(); /* Exiting BPF scheduler */
}
ops.running(); /* Task starts running on its assigned CPU */
while (task->scx.slice > 0 && task is runnable)
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index bcb962d5ee7d8..4601e5ecb43c0 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -84,6 +84,7 @@ struct scx_dispatch_q {
/* scx_entity.flags */
enum scx_ent_flags {
SCX_TASK_QUEUED = 1 << 0, /* on ext runqueue */
+ SCX_TASK_IN_CUSTODY = 1 << 1, /* in custody, needs ops.dequeue() when leaving */
SCX_TASK_RESET_RUNNABLE_AT = 1 << 2, /* runnable_at should be reset */
SCX_TASK_DEQD_FOR_SLEEP = 1 << 3, /* last dequeue was for SLEEP */

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 0bb8fa927e9e9..5f7c9088f90a9 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -925,6 +925,27 @@ static void touch_core_sched(struct rq *rq, struct task_struct *p)
#endif
}

+/**
+ * is_terminal_dsq - Check if a DSQ is terminal for ops.dequeue() purposes
+ * @dsq_id: DSQ ID to check
+ *
+ * Returns true if @dsq_id is a terminal/builtin DSQ where the BPF
+ * scheduler is considered "done" with the task.
+ *
+ * Builtin DSQs include:
+ * - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
+ * where tasks go directly to execution,
+ * - Global DSQ (%SCX_DSQ_GLOBAL): built-in fallback queue,
+ * - Bypass DSQ: used during bypass mode.
+ *
+ * Tasks dispatched to builtin DSQs exit BPF scheduler custody and do not
+ * trigger ops.dequeue() when they are later consumed.
+ */
+static inline bool is_terminal_dsq(u64 dsq_id)
+{
+ return dsq_id & SCX_DSQ_FLAG_BUILTIN && dsq_id != SCX_DSQ_INVALID;
+}
+
/**
* touch_core_sched_dispatch - Update core-sched timestamp on dispatch
* @rq: rq to read clock from, must be locked
@@ -1008,7 +1029,8 @@ static void local_dsq_post_enq(struct scx_dispatch_q *dsq, struct task_struct *p
resched_curr(rq);
}

-static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
+static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
+ struct scx_dispatch_q *dsq,
struct task_struct *p, u64 enq_flags)
{
bool is_local = dsq->id == SCX_DSQ_LOCAL;
@@ -1103,6 +1125,23 @@ static void dispatch_enqueue(struct scx_sched *sch, struct scx_dispatch_q *dsq,
dsq_mod_nr(dsq, 1);
p->scx.dsq = dsq;

+ /*
+ * Handle ops.dequeue() and custody tracking.
+ *
+ * Terminal DSQs: the BPF scheduler is done with the task. If it
+ * was in BPF custody, call ops.dequeue() and clear the flag.
+ *
+ * Non-terminal DSQs: task is in BPF scheduler's custody.
+ */
+ if (is_terminal_dsq(dsq->id)) {
+ if (SCX_HAS_OP(sch, dequeue) &&
+ (p->scx.flags & SCX_TASK_IN_CUSTODY))
+ SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, 0);
+ p->scx.flags &= ~SCX_TASK_IN_CUSTODY;
+ } else {
+ p->scx.flags |= SCX_TASK_IN_CUSTODY;
+ }
+
/*
* scx.ddsp_dsq_id and scx.ddsp_enq_flags are only relevant on the
* direct dispatch path, but we clear them here because the direct
@@ -1323,7 +1362,7 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
return;
}

- dispatch_enqueue(sch, dsq, p,
+ dispatch_enqueue(sch, rq, dsq, p,
p->scx.ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
}

@@ -1407,13 +1446,19 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
* dequeue may be waiting. The store_release matches their load_acquire.
*/
atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED | qseq);
+
+ /*
+ * Task is now in BPF scheduler's custody. Set %SCX_TASK_IN_CUSTODY
+ * so ops.dequeue() is called when it leaves custody.
+ */
+ p->scx.flags |= SCX_TASK_IN_CUSTODY;
return;

direct:
direct_dispatch(sch, p, enq_flags);
return;
local_norefill:
- dispatch_enqueue(sch, &rq->scx.local_dsq, p, enq_flags);
+ dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, enq_flags);
return;
local:
dsq = &rq->scx.local_dsq;
@@ -1433,7 +1478,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
*/
touch_core_sched(rq, p);
refill_task_slice_dfl(sch, p);
- dispatch_enqueue(sch, dsq, p, enq_flags);
+ dispatch_enqueue(sch, rq, dsq, p, enq_flags);
}

static bool task_runnable(const struct task_struct *p)
@@ -1511,6 +1556,27 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags
__scx_add_event(sch, SCX_EV_SELECT_CPU_FALLBACK, 1);
}

+/*
+ * Call ops.dequeue() for a task leaving BPF custody.
+ */
+static void call_task_dequeue(struct scx_sched *sch, struct rq *rq,
+ struct task_struct *p, u64 deq_flags,
+ bool is_sched_change)
+{
+ if (SCX_HAS_OP(sch, dequeue)) {
+ /*
+ * Set %SCX_DEQ_SCHED_CHANGE when the dequeue is due to a
+ * property change (not sleep or core-sched pick).
+ */
+ if (is_sched_change &&
+ !(deq_flags & (DEQUEUE_SLEEP | SCX_DEQ_CORE_SCHED_EXEC)))
+ deq_flags |= SCX_DEQ_SCHED_CHANGE;
+
+ SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq, p, deq_flags);
+ }
+ p->scx.flags &= ~SCX_TASK_IN_CUSTODY;
+}
+
static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
{
struct scx_sched *sch = scx_root;
@@ -1524,6 +1590,12 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)

switch (opss & SCX_OPSS_STATE_MASK) {
case SCX_OPSS_NONE:
+ /*
+ * If the task is still in BPF scheduler's custody
+ * (%SCX_TASK_IN_CUSTODY is set) call ops.dequeue().
+ */
+ if (p->scx.flags & SCX_TASK_IN_CUSTODY)
+ call_task_dequeue(sch, rq, p, deq_flags, true);
break;
case SCX_OPSS_QUEUEING:
/*
@@ -1532,9 +1604,12 @@ static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags)
*/
BUG();
case SCX_OPSS_QUEUED:
- if (SCX_HAS_OP(sch, dequeue))
- SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, rq,
- p, deq_flags);
+ /*
+ * Task is BPF scheduler's custody (not dispatched yet).
+ * Call ops.dequeue() to notify that it's leaving custody.
+ */
+ WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_IN_CUSTODY));
+ call_task_dequeue(sch, rq, p, deq_flags, true);

if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
SCX_OPSS_NONE))
@@ -1631,6 +1706,7 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,
struct scx_dispatch_q *src_dsq,
struct rq *dst_rq)
{
+ struct scx_sched *sch = scx_root;
struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq;

/* @dsq is locked and @p is on @dst_rq */
@@ -1639,6 +1715,16 @@ static void move_local_task_to_local_dsq(struct task_struct *p, u64 enq_flags,

WARN_ON_ONCE(p->scx.holding_cpu >= 0);

+ /*
+ * Task is moving from a non-local DSQ to a local (terminal) DSQ.
+ * Call ops.dequeue() if the task was in BPF custody.
+ */
+ if (p->scx.flags & SCX_TASK_IN_CUSTODY) {
+ if (SCX_HAS_OP(sch, dequeue))
+ SCX_CALL_OP_TASK(sch, SCX_KF_REST, dequeue, dst_rq, p, 0);
+ p->scx.flags &= ~SCX_TASK_IN_CUSTODY;
+ }
+
if (enq_flags & (SCX_ENQ_HEAD | SCX_ENQ_PREEMPT))
list_add(&p->scx.dsq_list.node, &dst_dsq->list);
else
@@ -1801,12 +1887,19 @@ static bool unlink_dsq_and_lock_src_rq(struct task_struct *p,
!WARN_ON_ONCE(src_rq != task_rq(p));
}

-static bool consume_remote_task(struct rq *this_rq, struct task_struct *p,
- struct scx_dispatch_q *dsq, struct rq *src_rq)
+static bool consume_remote_task(struct scx_sched *sch, struct rq *this_rq,
+ struct task_struct *p,
+ struct scx_dispatch_q *dsq, struct rq *src_rq)
{
raw_spin_rq_unlock(this_rq);

if (unlink_dsq_and_lock_src_rq(p, dsq, src_rq)) {
+ /*
+ * Task is moving from a non-local DSQ to a local (terminal) DSQ.
+ * Call ops.dequeue() if the task was in BPF custody.
+ */
+ if (p->scx.flags & SCX_TASK_IN_CUSTODY)
+ call_task_dequeue(sch, src_rq, p, 0, false);
move_remote_task_to_local_dsq(p, 0, src_rq, this_rq);
return true;
} else {
@@ -1867,6 +1960,13 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
src_dsq, dst_rq);
raw_spin_unlock(&src_dsq->lock);
} else {
+ /*
+ * Moving to a local DSQ, dispatch_enqueue() is not
+ * used, so call ops.dequeue() here if the task was
+ * in BPF scheduler's custody.
+ */
+ if (p->scx.flags & SCX_TASK_IN_CUSTODY)
+ call_task_dequeue(sch, src_rq, p, 0, false);
raw_spin_unlock(&src_dsq->lock);
move_remote_task_to_local_dsq(p, enq_flags,
src_rq, dst_rq);
@@ -1879,7 +1979,7 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
dispatch_dequeue_locked(p, src_dsq);
raw_spin_unlock(&src_dsq->lock);

- dispatch_enqueue(sch, dst_dsq, p, enq_flags);
+ dispatch_enqueue(sch, dst_rq, dst_dsq, p, enq_flags);
}

return dst_rq;
@@ -1922,7 +2022,7 @@ static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
}

if (task_can_run_on_remote_rq(sch, p, rq, false)) {
- if (likely(consume_remote_task(rq, p, dsq, task_rq)))
+ if (likely(consume_remote_task(sch, rq, p, dsq, task_rq)))
return true;
goto retry;
}
@@ -1969,14 +2069,14 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
* If dispatching to @rq that @p is already on, no lock dancing needed.
*/
if (rq == src_rq && rq == dst_rq) {
- dispatch_enqueue(sch, dst_dsq, p,
+ dispatch_enqueue(sch, rq, dst_dsq, p,
enq_flags | SCX_ENQ_CLEAR_OPSS);
return;
}

if (src_rq != dst_rq &&
unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
- dispatch_enqueue(sch, find_global_dsq(sch, p), p,
+ dispatch_enqueue(sch, rq, find_global_dsq(sch, p), p,
enq_flags | SCX_ENQ_CLEAR_OPSS);
return;
}
@@ -2014,9 +2114,16 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
*/
if (src_rq == dst_rq) {
p->scx.holding_cpu = -1;
- dispatch_enqueue(sch, &dst_rq->scx.local_dsq, p,
+ dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p,
enq_flags);
} else {
+ /*
+ * Moving to a local DSQ, dispatch_enqueue() is not
+ * used, so call ops.dequeue() here if the task was
+ * in BPF scheduler's custody.
+ */
+ if (p->scx.flags & SCX_TASK_IN_CUSTODY)
+ call_task_dequeue(sch, src_rq, p, 0, false);
move_remote_task_to_local_dsq(p, enq_flags,
src_rq, dst_rq);
/* task has been moved to dst_rq, which is now locked */
@@ -2113,7 +2220,7 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
if (dsq->id == SCX_DSQ_LOCAL)
dispatch_to_local_dsq(sch, rq, dsq, p, enq_flags);
else
- dispatch_enqueue(sch, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
+ dispatch_enqueue(sch, rq, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
}

static void flush_dispatch_buf(struct scx_sched *sch, struct rq *rq)
@@ -2414,7 +2521,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
* DSQ.
*/
if (p->scx.slice && !scx_rq_bypassing(rq)) {
- dispatch_enqueue(sch, &rq->scx.local_dsq, p,
+ dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p,
SCX_ENQ_HEAD);
goto switch_class;
}
@@ -2898,6 +3005,13 @@ static void scx_enable_task(struct task_struct *p)

lockdep_assert_rq_held(rq);

+ /*
+ * Verify the task is not in BPF scheduler's custody. If flag
+ * transitions are consistent, the flag should always be clear
+ * here.
+ */
+ WARN_ON_ONCE(p->scx.flags & SCX_TASK_IN_CUSTODY);
+
/*
* Set the weight before calling ops.enable() so that the scheduler
* doesn't see a stale value if they inspect the task struct.
@@ -2929,6 +3043,13 @@ static void scx_disable_task(struct task_struct *p)
if (SCX_HAS_OP(sch, disable))
SCX_CALL_OP_TASK(sch, SCX_KF_REST, disable, rq, p);
scx_set_task_state(p, SCX_TASK_READY);
+
+ /*
+ * Verify the task is not in BPF scheduler's custody. If flag
+ * transitions are consistent, the flag should always be clear
+ * here.
+ */
+ WARN_ON_ONCE(p->scx.flags & SCX_TASK_IN_CUSTODY);
}

static void scx_exit_task(struct task_struct *p)
@@ -3919,7 +4040,7 @@ static u32 bypass_lb_cpu(struct scx_sched *sch, struct rq *rq,
* between bypass DSQs.
*/
dispatch_dequeue_locked(p, donor_dsq);
- dispatch_enqueue(sch, donee_dsq, p, SCX_ENQ_NESTED);
+ dispatch_enqueue(sch, donee_rq, donee_dsq, p, SCX_ENQ_NESTED);

/*
* $donee might have been idle and need to be woken up. No need
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 386c677e4c9a0..befa9a5d6e53f 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -982,6 +982,13 @@ enum scx_deq_flags {
* it hasn't been dispatched yet. Dequeue from the BPF side.
*/
SCX_DEQ_CORE_SCHED_EXEC = 1LLU << 32,
+
+ /*
+ * The task is being dequeued due to a property change (e.g.,
+ * sched_setaffinity(), sched_setscheduler(), set_user_nice(),
+ * etc.).
+ */
+ SCX_DEQ_SCHED_CHANGE = 1LLU << 33,
};

enum scx_pick_idle_cpu_flags {
diff --git a/tools/sched_ext/include/scx/enum_defs.autogen.h b/tools/sched_ext/include/scx/enum_defs.autogen.h
index c2c33df9292c2..dcc945304760f 100644
--- a/tools/sched_ext/include/scx/enum_defs.autogen.h
+++ b/tools/sched_ext/include/scx/enum_defs.autogen.h
@@ -21,6 +21,7 @@
#define HAVE_SCX_CPU_PREEMPT_UNKNOWN
#define HAVE_SCX_DEQ_SLEEP
#define HAVE_SCX_DEQ_CORE_SCHED_EXEC
+#define HAVE_SCX_DEQ_SCHED_CHANGE
#define HAVE_SCX_DSQ_FLAG_BUILTIN
#define HAVE_SCX_DSQ_FLAG_LOCAL_ON
#define HAVE_SCX_DSQ_INVALID
diff --git a/tools/sched_ext/include/scx/enums.autogen.bpf.h b/tools/sched_ext/include/scx/enums.autogen.bpf.h
index 2f8002bcc19ad..5da50f9376844 100644
--- a/tools/sched_ext/include/scx/enums.autogen.bpf.h
+++ b/tools/sched_ext/include/scx/enums.autogen.bpf.h
@@ -127,3 +127,5 @@ const volatile u64 __SCX_ENQ_CLEAR_OPSS __weak;
const volatile u64 __SCX_ENQ_DSQ_PRIQ __weak;
#define SCX_ENQ_DSQ_PRIQ __SCX_ENQ_DSQ_PRIQ

+const volatile u64 __SCX_DEQ_SCHED_CHANGE __weak;
+#define SCX_DEQ_SCHED_CHANGE __SCX_DEQ_SCHED_CHANGE
diff --git a/tools/sched_ext/include/scx/enums.autogen.h b/tools/sched_ext/include/scx/enums.autogen.h
index fedec938584be..fc9a7a4d9dea5 100644
--- a/tools/sched_ext/include/scx/enums.autogen.h
+++ b/tools/sched_ext/include/scx/enums.autogen.h
@@ -46,4 +46,5 @@
SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_LAST); \
SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_CLEAR_OPSS); \
SCX_ENUM_SET(skel, scx_enq_flags, SCX_ENQ_DSQ_PRIQ); \
+ SCX_ENUM_SET(skel, scx_deq_flags, SCX_DEQ_SCHED_CHANGE); \
} while (0)
--
2.53.0