Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution

From: John Stultz

Date: Thu Apr 02 2026 - 14:37:56 EST


On Thu, Apr 2, 2026 at 8:51 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Fri, Mar 27, 2026 at 10:27:08PM +0530, K Prateek Nayak wrote:
>
> > So taking a step back, this is what we have today (at least the
> > common scenario):
> >
> > CPU0 (donor - A) CPU1 (owner - B)
> > ================ ================
> >
> > mutex_lock()
> > __set_current_state(TASK_INTERRUPTIBLE)
> > __set_task_blocked_on(M)
> > schedule()
> > /* Retained for proxy */
> > proxy_migrate_task()
> > ==================================> /* Migrates to CPU1 */
> > ...
> > send_sig(B)
> > signal_wake_up_state()
> > wake_up_state()
> > try_to_wake_up()
> > ttwu_runnable()
> > ttwu_do_wakeup() =============> /* A->__state = TASK_RUNNING */
> >
> > /*
> > * After this point ttwu_state_match()
> > * will fail for A so a mutex_unlock()
> > * will have to go through __schedule()
> > * for return migration.
> > */
> >
> > __schedule()
> > find_proxy_task()
> >
> > /* Scenario 1 - B sleeps */
> > __clear_task_blocked_on()
> > proxy_deactivate(A)
> > /* A->__state == TASK_RUNNING */
> > /* fallthrough */
> >
> > /* Scenario 2 - return migration after unlock() */
> > __clear_task_blocked_on()
> > /*
> > * At this point proxy stops.
> > * Much later after signal.
> > */
> > proxy_force_return()
> > schedule() <==================================
> > signal_pending_state()
> >
> > clear_task_blocked_on()
> > __set_current_state(TASK_RUNNING)
> >
> > ... /* return with -EINR */
> >
> >
> > Basically, a blocked donor has to wait for a mutex_unlock() before it
> > can go process the signal and bail out on the mutex_lock_interruptible()
> > which seems counter productive - but it is still okay from correctness
> > perspective.
> >
> > >
> > > One thing you *can* do it frob ttwu_runnable() to 'refuse' to wake the
> > > task, and then it goes into the normal path and will do the migration.
> > > I've done things like that before.
> > >
> > > Does that fix all the return-migration cases?
> >
> > Yes it does! If we handle the return via ttwu_runnable(), which is what
> > proxy_needs_return() in the next chunk of changes aims to do and we can
> > build the invariant that TASK_RUNNING + task_is_blocked() is an illegal
> > state outside of __schedule() which works well with ttwu_state_match().
> >
> > >
> > >> 2. Why does proxy_needs_return() (this comes later in John's tree but I
> > >> moved it up ahead) need the proxy_task_runnable_but_waking() override
> > >> of the ttwu_state_mach() machinery?
> > >> (https://github.com/johnstultz-work/linux-dev/commit/28ad4d3fa847b90713ca18a623d1ee7f73b648d9)
> > >
> > > Since it comes later, I've not seen it and not given it thought ;-)
> > >
> > > (I mean, I've probably seen it at some point, but being the gold-fish
> > > that I am, I have no recollection, so I might as well not have seen it).
> > >
> > > A brief look now makes me confused. The comment fails to describe how
> > > that situation could ever come to pass.
> >
> > That is a signal delivery happening before unlock which will force
> > TASK_RUNNING but since we are waiting on an unlock, the wakeup from
> > unlock will see TASK_RUNNING + PROXY_WAKING.
> >
> > We then later force it on the ttwu path to do return via
> > ttwu_runnable().
>
> So, I've not gone through all the cases yet, and it is *COMPLETELY*
> untested, but something like the below perhaps?
>

So, just to clarify, this suggestion is as an alternative to my
return-migration via ttwu logic (not included in the v26
simple-donor-migration chunk you've queued)?
https://github.com/johnstultz-work/linux-dev/commit/dfaa472f2a0b2f6a0c73083aaba5c55e256fdb56

I'm working on prepping that next chunk of the series (which includes
that logic) to send out here shortly (integrating the few bits I from
Prateek that I've managerd to get my head around).

There's still a bunch of other changes tied into waking a rq->donor
outside of __schedule() in that chunk, so I'm not sure if this
discussion would be easier to have in context once those are on the
list?

So some of my (certainly confused) thoughts below...

> ---
> include/linux/sched.h | 2
> kernel/sched/core.c | 173 ++++++++++++++++----------------------------------
> 2 files changed, 58 insertions(+), 117 deletions(-)
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -161,7 +161,7 @@ struct user_event_mm;
> */
> #define is_special_task_state(state) \
> ((state) & (__TASK_STOPPED | __TASK_TRACED | TASK_PARKED | \
> - TASK_DEAD | TASK_FROZEN))
> + TASK_DEAD | TASK_WAKING | TASK_FROZEN))
>
> #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
> # define debug_normal_state_change(state_value) \
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
...
> @@ -3702,28 +3723,39 @@ ttwu_do_activate(struct rq *rq, struct t
> */
> static int ttwu_runnable(struct task_struct *p, int wake_flags)
> {
> - struct rq_flags rf;
> - struct rq *rq;
> - int ret = 0;
> + ACQUIRE(__task_rq_lock, guard)(p);
> + struct rq *rq = guard.rq;
>
> - rq = __task_rq_lock(p, &rf);
> - if (task_on_rq_queued(p)) {
> - update_rq_clock(rq);
> - if (p->se.sched_delayed)
> - enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> - if (!task_on_cpu(rq, p)) {
> + if (!task_on_rq_queued(p))
> + return 0;
> +
> + if (sched_proxy_exec() && p->blocked_on) {
> + guard(raw_spinlock)(&p->blocked_lock);
> + struct mutex *lock = p->blocked_on;
> + if (lock) {
> /*
> - * When on_rq && !on_cpu the task is preempted, see if
> - * it should preempt the task that is current now.
> + * TASK_WAKING is a special state and results in
> + * DEQUEUE_SPECIAL such that the task will actually be
> + * forced from the runqueue.
> */
> - wakeup_preempt(rq, p, wake_flags);
> + block_task(rq, p, TASK_WAKING);
> + p->blocked_on = NULL;
> + return 0;
> }
> - ttwu_do_wakeup(p);
> - ret = 1;
> }
> - __task_rq_unlock(rq, p, &rf);
>
> - return ret;
> + update_rq_clock(rq);
> + if (p->se.sched_delayed)
> + enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);

I can't precisely remember the details now, but I believe we need to
handle enqueueing sched_delayed tasks before handling blocked_on
tasks.


> + if (!task_on_cpu(rq, p)) {
> + /*
> + * When on_rq && !on_cpu the task is preempted, see if
> + * it should preempt the task that is current now.
> + */
> + wakeup_preempt(rq, p, wake_flags);
> + }
> + ttwu_do_wakeup(p);
> + return 1;
> }
>
> void sched_ttwu_pending(void *arg)
...
> @@ -6586,13 +6598,12 @@ static inline struct task_struct *proxy_
> return rq->idle;
> }
>
> -static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
> +static void proxy_deactivate(struct rq *rq, struct task_struct *donor)
> {
> unsigned long state = READ_ONCE(donor->__state);
>
> - /* Don't deactivate if the state has been changed to TASK_RUNNING */
> - if (state == TASK_RUNNING)
> - return false;
> + WARN_ON_ONCE(state == TASK_RUNNING);
> +
> /*
> * Because we got donor from pick_next_task(), it is *crucial*
> * that we call proxy_resched_idle() before we deactivate it.
> @@ -6603,7 +6614,7 @@ static bool proxy_deactivate(struct rq *
> * need to be changed from next *before* we deactivate.
> */
> proxy_resched_idle(rq);
> - return try_to_block_task(rq, donor, &state, true);
> + block_task(rq, donor, state);
> }
>
> static inline void proxy_release_rq_lock(struct rq *rq, struct rq_flags *rf)
> @@ -6677,71 +6688,6 @@ static void proxy_migrate_task(struct rq
> proxy_reacquire_rq_lock(rq, rf);
> }
>
> -static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> - struct task_struct *p)
> - __must_hold(__rq_lockp(rq))
> -{
> - struct rq *task_rq, *target_rq = NULL;
> - int cpu, wake_flag = WF_TTWU;
> -
> - lockdep_assert_rq_held(rq);
> - WARN_ON(p == rq->curr);
> -
> - if (p == rq->donor)
> - proxy_resched_idle(rq);
> -
> - proxy_release_rq_lock(rq, rf);
> - /*
> - * We drop the rq lock, and re-grab task_rq_lock to get
> - * the pi_lock (needed for select_task_rq) as well.
> - */
> - scoped_guard (task_rq_lock, p) {
> - task_rq = scope.rq;
> -
> - /*
> - * Since we let go of the rq lock, the task may have been
> - * woken or migrated to another rq before we got the
> - * task_rq_lock. So re-check we're on the same RQ. If
> - * not, the task has already been migrated and that CPU
> - * will handle any futher migrations.
> - */
> - if (task_rq != rq)
> - break;
> -
> - /*
> - * Similarly, if we've been dequeued, someone else will
> - * wake us
> - */
> - if (!task_on_rq_queued(p))
> - break;
> -
> - /*
> - * Since we should only be calling here from __schedule()
> - * -> find_proxy_task(), no one else should have
> - * assigned current out from under us. But check and warn
> - * if we see this, then bail.
> - */
> - if (task_current(task_rq, p) || task_on_cpu(task_rq, p)) {
> - WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d on_cpu: %i\n",
> - __func__, cpu_of(task_rq),
> - p->comm, p->pid, p->on_cpu);
> - break;
> - }
> -
> - update_rq_clock(task_rq);
> - deactivate_task(task_rq, p, DEQUEUE_NOCLOCK);
> - cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
> - set_task_cpu(p, cpu);
> - target_rq = cpu_rq(cpu);
> - clear_task_blocked_on(p, NULL);
> - }
> -
> - if (target_rq)
> - attach_one_task(target_rq, p);
> -
> - proxy_reacquire_rq_lock(rq, rf);
> -}
> -
> /*
> * Find runnable lock owner to proxy for mutex blocked donor
> *
> @@ -6777,7 +6723,7 @@ find_proxy_task(struct rq *rq, struct ta
> clear_task_blocked_on(p, PROXY_WAKING);
> return p;
> }
> - goto force_return;
> + goto deactivate;
> }
>
> /*
> @@ -6812,7 +6758,7 @@ find_proxy_task(struct rq *rq, struct ta
> __clear_task_blocked_on(p, NULL);
> return p;
> }
> - goto force_return;
> + goto deactivate;
> }
>
> if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
> @@ -6891,12 +6837,7 @@ find_proxy_task(struct rq *rq, struct ta
> return owner;
>
> deactivate:
> - if (proxy_deactivate(rq, donor))
> - return NULL;
> - /* If deactivate fails, force return */
> - p = donor;
> -force_return:
> - proxy_force_return(rq, rf, p);
> + proxy_deactivate(rq, donor);
> return NULL;
> migrate_task:
> proxy_migrate_task(rq, rf, p, owner_cpu);

So I like getting rid of proxy_force_return(), but its not clear to me
that proxy_deactivate() is what we want to do in these
find_proxy_task() edge cases.

It feels like if we are already racing with ttwu, deactivating the
task seems like it might open more windows where we might lose the
wakeup.

In fact, the whole reason we have proxy_force_return() is that earlier
in the proxy-exec development, when we hit those edge cases we usually
would return proxy_reschedule_idle() just to drop the rq lock and let
ttwu do its thing, but there kept on being cases where we would end up
with lost wakeups.

But I'll give this a shot (and will integrate your ttwu_runnable
cleanups regardless) and see how it does.

thanks
-john