Re: [PATCH v2] sched/rt: Push RT tasks when preempted by a deadline task

From: K Prateek Nayak

Date: Fri Jun 26 2026 - 01:52:59 EST

Hello Yangsonghua,

On 6/26/2026 8:53 AM, yangsonghua wrote:
> [You don't often get email from jluyangsonghua@xxxxxxxxx. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> Commit 704069649b5b ("sched/core: Rework sched_class::wakeup_preempt()
> and rq_modified_*()") made wakeup_preempt_rt() callable for cross-class
> wakeups, leaving a "XXX If we're preempted by DL, queue a push?" comment
> with no implementation.
>
> When a SCHED_DEADLINE task preempts a running SCHED_FIFO/SCHED_RR task,
> the RT class's put_prev_task_rt() is the natural place to trigger a push:
> at that point the preempted task is (conditionally) added to
> pushable_tasks, and we have 'next' available to identify the reason for
> preemption.
>
> The earlier wakeup_preempt_rt() call site is too early: the running RT
> task has not yet been added to pushable_tasks, so rt_queue_push_tasks()
> would be a no-op for the common single-RT-task-per-CPU case.
>
> Guard the dl_task(next) check with a NULL test. put_prev_task_rt() is
> reachable with next == NULL when called via the bare put_prev_task()
> wrapper (sched.h), which always passes NULL:
>
> sched_change_begin()
> put_prev_task(rq, p) /* next = NULL */
> put_prev_task_rt(rq, p, NULL)
> dl_task(NULL) -> NULL deref /* BUG in v1 */
>
> The actual DL-preemption path goes through put_prev_set_next_task(),
> which supplies the real next task, so the NULL guard does not change
> the intended trigger condition.
>
> Regarding potentially unnecessary push/pull cycles for short-lived DL
> tasks: this is an inherent property of any asynchronous push mechanism.
> rt_queue_push_tasks() enqueues a balance callback rather than migrating
> synchronously, so by the time it fires the DL task may have already
> yielded; in that case the RT task may need to be pulled back. This
> trade-off is acceptable: leaving RT tasks stranded on a busy CPU
> while peer CPUs sit idle is worse than an occasional redundant
> migration. The same concern applies to the existing triggers of
> rt_queue_push_tasks() (e.g. after dequeue_task_rt), so this path is
> consistent with the established policy.
>
> Fix this by adding a dl_task(next) check in put_prev_task_rt(), guarded
> against NULL. This is intentionally placed outside and after the
> enqueue_pushable_task() condition, so that a push is triggered even when
> the preempted RT task is pinned (nr_cpus_allowed == 1) or blocked
> (proxy-exec): in those cases there may be other migratable RT tasks
> already sitting in pushable_tasks.

That cannot be the case. More details below.

>
> The task_is_blocked() early-return is folded into the
> enqueue_pushable_task() condition to avoid inadvertently suppressing the
> new push logic.
>
> Note: this does not cover the DL-server case where a fair-class task
> running under DL bandwidth budget displaces an RT task; that is a
> separate concern.
>
> Signed-off-by: yangsonghua <yangsonghua@xxxxxxxxxxx>
> ---
> Changes in v2:
> - Guard dl_task(next) with a 'next != NULL' check to fix a NULL pointer
> dereference when put_prev_task_rt() is called from sched_change_begin()
> via the put_prev_task() wrapper, which always passes next = NULL.
> - Extend the commit message to explain the NULL call path and address the
> reviewer concern about unnecessary push/pull cycles for short-lived DL
> tasks.
> - Improve wakeup_preempt_rt() comment to explicitly point readers to
> put_prev_task_rt() where the DL push is now handled.
>
> kernel/sched/rt.c | 24 +++++++++++++++++++-----
> 1 file changed, 19 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index e474c31d8fe6..ef8cdc4c25d2 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1627,7 +1627,10 @@ static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
> struct task_struct *donor = rq->donor;
>
> /*
> - * XXX If we're preempted by DL, queue a push?
> + * If the incoming task belongs to a higher-priority class (e.g. DL),
> + * we only handle same-class RT preemption here. The DL case is
> + * handled in put_prev_task_rt() once the current task has been
> + * added to pushable_tasks.
> */
> if (p->sched_class != &rt_sched_class)
> return;
> @@ -1736,14 +1739,25 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct task_s
>
> update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1);
>
> - if (task_is_blocked(p))
> - return;

I still don't get why we are removing this early return. When "p" was
selected to run, it would have already queued a balance callback via
set_nex_task_rt() and the balance would have been done during last
__schedule().

If there are pushable tasks on the list still, they'll still remain
non-pushable since they haven't run on CPU and nothing has changed for
them to suddenly enable a movement.

"p" itself is a blocked task so it cannot be moved - it'll simply
follow the lock owner via proxy migration path whenever it gets a
chance.

So the gist is: If there are pushable tasks, then balance failed to
move them previously. If the state of the system had changed,
balance_rt() on a remote CPU would have already pulled the highest
priority task from this rq.

I feel that early return for blocked task should stay as is since
there is no opportunity for balance yet unless I'm missing something.

> /*
> * The previous task needs to be made eligible for pushing
> - * if it is still active
> + * if it is still active and migratable.
> */
> - if (on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1)
> + if (!task_is_blocked(p) && on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1)
> enqueue_pushable_task(rq, p);
> +
> + /*
> + * When a deadline task takes over this CPU, try to push any queued
> + * RT tasks to CPUs running lower-priority work. This is independent
> + * of whether p itself is pushable: even if p is pinned or blocked,
> + * there may be other migratable RT tasks already in pushable_tasks.
> + *
> + * next is NULL when called from put_prev_task() (e.g. sched_change_begin);
> + * guard accordingly. rt_queue_push_tasks() checks has_pushable_tasks()
> + * internally, so this is a no-op if nothing is queued.
> + */
> + if (next && dl_task(next))
> + rt_queue_push_tasks(rq);

Based on some quick look at the cpupri bits, I think this makes sense.

Since we try to move "p" to a CPU running a lower priority task and a
deadline task would have increased the priority of this CPU anyways, it
is fair to attempt a push.

I'll let Steve finish his metamorphosis and see if that estimate is
wrong or not ;-)

> }
>
> /* Only try algorithms three times */
> --
> 2.34.1
>

--
Thanks and Regards,
Prateek