Re: [PATCH v2] sched_ext: Rebuild fair weight on ext to fair switches

From: Peter Zijlstra

Date: Wed May 27 2026 - 07:26:48 EST


On Wed, May 27, 2026 at 05:40:37PM +0800, quzicheng315@xxxxxxxxx wrote:
> From: Zicheng Qu <quzicheng315@xxxxxxxxx>
>
> Tasks running on sched_ext do not use p->se.load as their active
> scheduling weight. Their nice-derived weight is maintained as
> p->scx.weight instead.
>
> When such a task switches back to fair, CFS expects p->se.load to match
> the task's current policy/static_prio before the task is enqueued.
> However, not all ext to fair transitions rebuild p->se.load. For
> example, scx_root_disable() switches tasks back to fair directly, and
> partial mode can move a task from SCHED_EXT to SCHED_NORMAL through
> sched_setscheduler(). In the latter case, set_load_weight(p, true) runs
> while p->sched_class is still ext_sched_class, so reweight_task_scx()
> updates p->scx.weight but leaves p->se.load stale.
>
> Rebuild the fair load weight in sched_change_end() when the class switch
> is from ext_sched_class to fair_sched_class. This is after the class has
> been changed and before the task is enqueued on fair, so CFS sees a
> native load_weight derived from the task's current policy/static_prio.
>
> Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class")
> Signed-off-by: Zicheng Qu <quzicheng@xxxxxxxxxx>
> ---
> Changes in v2:
> - Move the fix from scx_root_disable() to sched_change_end() so the same
> ext-to-fair rebuild also covers partial mode SCHED_EXT to SCHED_NORMAL
> transitions through sched_setscheduler(), as Andrea pointed out.
>
> kernel/sched/core.c | 2 ++
> kernel/sched/ext.h | 11 +++++++++++
> 2 files changed, 13 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b8871449d3c6..c694aabc451a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -11200,6 +11200,8 @@ void sched_change_end(struct sched_change_ctx *ctx)
> */
> WARN_ON_ONCE(p->sched_class != ctx->class && !(ctx->flags & ENQUEUE_CLASS));
>
> + scx_rebuild_fair_weight_on_class_switch(p, ctx->class, p->sched_class);
> +
> if ((ctx->flags & ENQUEUE_CLASS) && p->sched_class->switching_to)
> p->sched_class->switching_to(rq, p);
>
> diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
> index 0b7fc46aee08..1f8248c897af 100644
> --- a/kernel/sched/ext.h
> +++ b/kernel/sched/ext.h
> @@ -35,6 +35,14 @@ static inline bool task_on_scx(const struct task_struct *p)
> return scx_enabled() && p->sched_class == &ext_sched_class;
> }
>
> +static inline void scx_rebuild_fair_weight_on_class_switch(struct task_struct *p,
> + const struct sched_class *old_class,
> + const struct sched_class *new_class)
> +{
> + if (old_class == &ext_sched_class && new_class == &fair_sched_class)
> + set_load_weight(p, false);
> +}
> +
> #ifdef CONFIG_SCHED_CORE
> bool scx_prio_less(const struct task_struct *a, const struct task_struct *b,
> bool in_fi);
> @@ -55,6 +63,9 @@ static inline int scx_check_setscheduler(struct task_struct *p, int policy) { re
> static inline bool task_on_scx(const struct task_struct *p) { return false; }
> static inline bool scx_allow_ttwu_queue(const struct task_struct *p) { return true; }
> static inline void init_sched_ext_class(void) {}
> +static inline void scx_rebuild_fair_weight_on_class_switch(struct task_struct *p,
> + const struct sched_class *old_class,
> + const struct sched_class *new_class) {}
>
> #endif /* CONFIG_SCHED_CLASS_EXT */

This is truly horrible. We have 4 class methods involved with switching
classes and you stick in a random call in a place that is called when no
class is changed.

Would not something like this work?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 62a2dcb0d03e..a2eb43bd73b9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -14957,6 +14957,11 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
detach_task_cfs_rq(p);
}

+static void switching_to_fair(struct rq *rq, struct task_struct *p)
+{
+ set_load_weight(p, false);
+}
+
static void switched_to_fair(struct rq *rq, struct task_struct *p)
{
WARN_ON_ONCE(p->se.sched_delayed);
@@ -15351,6 +15356,7 @@ DEFINE_SCHED_CLASS(fair) = {
.prio_changed = prio_changed_fair,
.switching_from = switching_from_fair,
.switched_from = switched_from_fair,
+ .switching_to = switching_to_fair,
.switched_to = switched_to_fair,

.get_rr_interval = get_rr_interval_fair,