Re: [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration)

From: John Stultz

Date: Wed Mar 18 2026 - 15:09:09 EST

On Sun, Mar 15, 2026 at 10:38 AM K Prateek Nayak <kprateek.nayak@xxxxxxx> wrote:
> On 3/13/2026 8:00 AM, John Stultz wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index af497b8c72dce..fe20204cf51cc 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3643,6 +3643,23 @@ void update_rq_avg_idle(struct rq *rq)
> > rq->idle_stamp = 0;
> > }
> >
> > +#ifdef CONFIG_SCHED_PROXY_EXEC
> > +static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
> > +{
> > + unsigned int wake_cpu;
> > +
> > + /*
> > + * Since we are enqueuing a blocked task on a cpu it may
> > + * not be able to run on, preserve wake_cpu when we
> > + * __set_task_cpu so we can return the task to where it
> > + * was previously runnable.
> > + */
> > + wake_cpu = p->wake_cpu;
> > + __set_task_cpu(p, cpu);
> > + p->wake_cpu = wake_cpu;
> > +}
> > +#endif /* CONFIG_SCHED_PROXY_EXEC */
> > +
> > static void
> > ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
> > struct rq_flags *rf)
> > @@ -4242,13 +4259,6 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> > ttwu_queue(p, cpu, wake_flags);
> > }
> > out:
> > - /*
> > - * For now, if we've been woken up, clear the task->blocked_on
> > - * regardless if it was set to a mutex or PROXY_WAKING so the
> > - * task can run. We will need to be more careful later when
> > - * properly handling proxy migration
> > - */
> > - clear_task_blocked_on(p, NULL);
>
> So, for this bit, there are mutex variants that are interruptible and
> killable which probably benefits from clearing the blocked_on
> relation.

This is a good point! I need to re-review some of this with that in mind.

> For potential proxy task that are still queued, we'll hit the
> ttwu_runnable() path and resched out of there so it makes sense to
> mark them as PROXY_WAKING so schedule() can return migrate them, they
> run and hit the signal_pending_state() check in __mutex_lock_common()
> loop, and return -EINTR.
>
> Otherwise, if they need a full wakeup, they may be blocked on a
> sleeping owner, in which case it is beneficial to clear blocked_on, do
> a full wakeup. and let them run to evaluate the pending signal.
>
> ttwu_state_match() should filter out any spurious signals. Thoughts?

So, I don't think we can keep clear_task_blocked_on(p, NULL) in the
out: path there, as then any wakeup would allow the task to run on
that runqueue, even if it was not smp affined.

But if we did go through the select_task_rq() logic, then clearing the
blocked_on bit should be safe. However if blocked_on is set, the task
is likely to be on the rq, so most cases will shortcut at
ttwu_runnable(), so we probably wouldn't get there.

So maybe if I understand your suggestion, we should
clear_task_blocked_on() if we select_task_rq(), and otherwise in the
error path set any blocked_on value to PROXY_WAKING?

I guess this could also move the set_task_blocked_on_waking into ttwu
instead of the lock waker logic. I'll play with that.

> > +static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> > + struct task_struct *p)
> > +{
> > + struct rq *this_rq, *target_rq;
> > + struct rq_flags this_rf;
> > + int cpu, wake_flag = WF_TTWU;
> > +
> > + lockdep_assert_rq_held(rq);
> > + WARN_ON(p == rq->curr);
> > +
> > + /*
> > + * We have to zap callbacks before unlocking the rq
> > + * as another CPU may jump in and call sched_balance_rq
> > + * which can trip the warning in rq_pin_lock() if we
> > + * leave callbacks set.
> > + */
> > + zap_balance_callbacks(rq);
> > + rq_unpin_lock(rq, rf);
> > + raw_spin_rq_unlock(rq);
> > +
> > + /*
> > + * We drop the rq lock, and re-grab task_rq_lock to get
> > + * the pi_lock (needed for select_task_rq) as well.
> > + */
> > + this_rq = task_rq_lock(p, &this_rf);
> > +
> > + /*
> > + * Since we let go of the rq lock, the task may have been
> > + * woken or migrated to another rq before we got the
> > + * task_rq_lock. So re-check we're on the same RQ. If
> > + * not, the task has already been migrated and that CPU
> > + * will handle any futher migrations.
> > + */
> > + if (this_rq != rq)
> > + goto err_out;
> > +
> > + /* Similarly, if we've been dequeued, someone else will wake us */
> > + if (!task_on_rq_queued(p))
> > + goto err_out;
> > +
> > + /*
> > + * Since we should only be calling here from __schedule()
> > + * -> find_proxy_task(), no one else should have
> > + * assigned current out from under us. But check and warn
> > + * if we see this, then bail.
> > + */
> > + if (task_current(this_rq, p) || task_on_cpu(this_rq, p)) {
> > + WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d on_cpu: %i\n",
> > + __func__, cpu_of(this_rq),
> > + p->comm, p->pid, p->on_cpu);
> > + goto err_out;
> > }
> > - return NULL;
> > +
> > + update_rq_clock(this_rq);
> > + proxy_resched_idle(this_rq);
>
> I still think this is too late, and only required if we are moving the
> donor. Can we do this before we drop the rq_lock so that a remote
> wakeup doesn't need to clear the this? (although I think we don't have

Sorry I'm not sure I'm following this bit. Are you suggesting the
update_rq_clock goes above the error handling? Or are you suggesting I
move proxy_resched_idle() elsewhere?

> that bit in the ttwu path anymore and we rely on the schedule() bits
> completely for return migration on this version - any particular
> reason?).

Yes, Peter wanted the return-migration via ttwu to be in a separate patch:
https://lore.kernel.org/lkml/20251009114302.GI3245006@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

>
> > + deactivate_task(this_rq, p, DEQUEUE_NOCLOCK);
> > + cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
> > + set_task_cpu(p, cpu);
> > + target_rq = cpu_rq(cpu);
> > + clear_task_blocked_on(p, NULL);
> > + task_rq_unlock(this_rq, p, &this_rf);
> > +
> > + attach_one_task(target_rq, p);
>
> I'm still having a hard time believing we cannot use wake_up_process()
> but let me look more into that tomorrow when the sun rises.

I'm curious to hear if you had much luck on this. I've tinkered a bit
today, but keep on hitting the same issue:

<<<Task A>>>
__mutex_unlock_slowpath(lock);
set_task_blocked_on_waking(task_B, lock);
wake_up_process(task_B); /* via wake_up_q() */
try_to_wake_up(task_B, TASK_NORMAL, 0);
ttwu_runnable(task_B, WF_TTWU); /*donor is on_rq, so we trip into this */
ttwu_do_wakeup(task_B);
WRITE_ONCE(p->__state, TASK_RUNNING);
preempt_schedule_irq()
__schedule()
next = pick_next_task(); /* returns task_B (still PROXY_WAKING) */
find_proxy_task(rq, task_B, &rf)
proxy_force_return(rq, rf, task_B);

At this point conceptually we want to dequeue task_B from the
runqueue, and call wake_up_process() so it will be return-migrated to
a runqueue it can run on.

However, the task state is already TASK_RUNNING now, so calling
wake_up_process() again will just shortcut out at ttwu_state_mach().
Transitioning to INTERRUPTABLE or something else before calling
wake_up_process seems risky to me (but let me know if I'm wrong here).
So to me, doing the manual deactivate/select_task_rq/attach_one_task
work in proxy_force_return() seems the most straight forward, even
though it is a little duplicative of the ttwu logic.

I think when I had something similar before, it was leaning on
modifications to ttwu(), which this patch avoids at Peter's request.
Though maybe this logic can be simplified with the later optimization
patch to do return migration in the ttwu path?

thanks
-john