Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution

From: K Prateek Nayak

Date: Fri Apr 03 2026 - 02:11:40 EST


Hello peter, John,

On 4/3/2026 12:01 AM, John Stultz wrote:
>> So, I've not gone through all the cases yet, and it is *COMPLETELY*
>> untested, but something like the below perhaps?
>>
>
> So, just to clarify, this suggestion is as an alternative to my
> return-migration via ttwu logic (not included in the v26
> simple-donor-migration chunk you've queued)?
> https://github.com/johnstultz-work/linux-dev/commit/dfaa472f2a0b2f6a0c73083aaba5c55e256fdb56
>
> I'm working on prepping that next chunk of the series (which includes
> that logic) to send out here shortly (integrating the few bits I from
> Prateek that I've managerd to get my head around).

Thanks and I agree that it might be more digestible that way.

>
> There's still a bunch of other changes tied into waking a rq->donor
> outside of __schedule() in that chunk, so I'm not sure if this
> discussion would be easier to have in context once those are on the
> list?
>
> So some of my (certainly confused) thoughts below...
>
>> ---
>> include/linux/sched.h | 2
>> kernel/sched/core.c | 173 ++++++++++++++++----------------------------------
>> 2 files changed, 58 insertions(+), 117 deletions(-)
>>
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -161,7 +161,7 @@ struct user_event_mm;
>> */
>> #define is_special_task_state(state) \
>> ((state) & (__TASK_STOPPED | __TASK_TRACED | TASK_PARKED | \
>> - TASK_DEAD | TASK_FROZEN))
>> + TASK_DEAD | TASK_WAKING | TASK_FROZEN))
>>
>> #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
>> # define debug_normal_state_change(state_value) \
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
> ...
>> @@ -3702,28 +3723,39 @@ ttwu_do_activate(struct rq *rq, struct t
>> */
>> static int ttwu_runnable(struct task_struct *p, int wake_flags)
>> {
>> - struct rq_flags rf;
>> - struct rq *rq;
>> - int ret = 0;
>> + ACQUIRE(__task_rq_lock, guard)(p);
>> + struct rq *rq = guard.rq;
>>
>> - rq = __task_rq_lock(p, &rf);
>> - if (task_on_rq_queued(p)) {
>> - update_rq_clock(rq);
>> - if (p->se.sched_delayed)
>> - enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
>> - if (!task_on_cpu(rq, p)) {
>> + if (!task_on_rq_queued(p))
>> + return 0;
>> +
>> + if (sched_proxy_exec() && p->blocked_on) {
>> + guard(raw_spinlock)(&p->blocked_lock);
>> + struct mutex *lock = p->blocked_on;
>> + if (lock) {
>> /*
>> - * When on_rq && !on_cpu the task is preempted, see if
>> - * it should preempt the task that is current now.
>> + * TASK_WAKING is a special state and results in
>> + * DEQUEUE_SPECIAL such that the task will actually be
>> + * forced from the runqueue.
>> */
>> - wakeup_preempt(rq, p, wake_flags);
>> + block_task(rq, p, TASK_WAKING);

This needs to reset the rq->donor if the task getting woken up is the
current donor.

>> + p->blocked_on = NULL;
>> + return 0;
>> }
>> - ttwu_do_wakeup(p);
>> - ret = 1;
>> }
>> - __task_rq_unlock(rq, p, &rf);
>>
>> - return ret;
>> + update_rq_clock(rq);

nit. Since block_task() adds a DEQUEUE_NOCLOCK now we need to move that
clock update before the block.

>> + if (p->se.sched_delayed)
>> + enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
>
> I can't precisely remember the details now, but I believe we need to
> handle enqueueing sched_delayed tasks before handling blocked_on
> tasks.

So proxy_deactivate() can still delay the task leading to
task_on_rq_queued() and the wakeup coming to ttwu_runnable() so either
we can dequeue it fully in proxy_deactivate() or we need to teach
block_task() to add a DEQUEUE_DELAYED flag when task_is_blocked().

I think the former is cleaner but we don't decay lag for fair task :-(

We can't simply re-enqueue it either since proxy migration might have
put it on a CPU outside its affinity mask so we need to take a full
dequeue + wakeup in ttwu_runnable().

>
>
>> + if (!task_on_cpu(rq, p)) {
>> + /*
>> + * When on_rq && !on_cpu the task is preempted, see if
>> + * it should preempt the task that is current now.
>> + */
>> + wakeup_preempt(rq, p, wake_flags);
>> + }
>> + ttwu_do_wakeup(p);
>> + return 1;
>> }
>>
>> void sched_ttwu_pending(void *arg)
> ...
>> @@ -6586,13 +6598,12 @@ static inline struct task_struct *proxy_
>> return rq->idle;
>> }
>>
>> -static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
>> +static void proxy_deactivate(struct rq *rq, struct task_struct *donor)
>> {
>> unsigned long state = READ_ONCE(donor->__state);
>>
>> - /* Don't deactivate if the state has been changed to TASK_RUNNING */
>> - if (state == TASK_RUNNING)
>> - return false;
>> + WARN_ON_ONCE(state == TASK_RUNNING);
>> +
>> /*
>> * Because we got donor from pick_next_task(), it is *crucial*
>> * that we call proxy_resched_idle() before we deactivate it.
>> @@ -6603,7 +6614,7 @@ static bool proxy_deactivate(struct rq *
>> * need to be changed from next *before* we deactivate.
>> */
>> proxy_resched_idle(rq);
>> - return try_to_block_task(rq, donor, &state, true);
>> + block_task(rq, donor, state);
>> }
>>
>> static inline void proxy_release_rq_lock(struct rq *rq, struct rq_flags *rf)
>> @@ -6677,71 +6688,6 @@ static void proxy_migrate_task(struct rq
>> proxy_reacquire_rq_lock(rq, rf);
>> }
>>
>> -static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
>> - struct task_struct *p)
>> - __must_hold(__rq_lockp(rq))
>> -{
>> - struct rq *task_rq, *target_rq = NULL;
>> - int cpu, wake_flag = WF_TTWU;
>> -
>> - lockdep_assert_rq_held(rq);
>> - WARN_ON(p == rq->curr);
>> -
>> - if (p == rq->donor)
>> - proxy_resched_idle(rq);
>> -
>> - proxy_release_rq_lock(rq, rf);
>> - /*
>> - * We drop the rq lock, and re-grab task_rq_lock to get
>> - * the pi_lock (needed for select_task_rq) as well.
>> - */
>> - scoped_guard (task_rq_lock, p) {
>> - task_rq = scope.rq;
>> -
>> - /*
>> - * Since we let go of the rq lock, the task may have been
>> - * woken or migrated to another rq before we got the
>> - * task_rq_lock. So re-check we're on the same RQ. If
>> - * not, the task has already been migrated and that CPU
>> - * will handle any futher migrations.
>> - */
>> - if (task_rq != rq)
>> - break;
>> -
>> - /*
>> - * Similarly, if we've been dequeued, someone else will
>> - * wake us
>> - */
>> - if (!task_on_rq_queued(p))
>> - break;
>> -
>> - /*
>> - * Since we should only be calling here from __schedule()
>> - * -> find_proxy_task(), no one else should have
>> - * assigned current out from under us. But check and warn
>> - * if we see this, then bail.
>> - */
>> - if (task_current(task_rq, p) || task_on_cpu(task_rq, p)) {
>> - WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d on_cpu: %i\n",
>> - __func__, cpu_of(task_rq),
>> - p->comm, p->pid, p->on_cpu);
>> - break;
>> - }
>> -
>> - update_rq_clock(task_rq);
>> - deactivate_task(task_rq, p, DEQUEUE_NOCLOCK);
>> - cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
>> - set_task_cpu(p, cpu);
>> - target_rq = cpu_rq(cpu);
>> - clear_task_blocked_on(p, NULL);
>> - }
>> -
>> - if (target_rq)
>> - attach_one_task(target_rq, p);
>> -
>> - proxy_reacquire_rq_lock(rq, rf);
>> -}
>> -

Went a little heavy of the delete there did you? :-)

>> /*
>> * Find runnable lock owner to proxy for mutex blocked donor
>> *
>> @@ -6777,7 +6723,7 @@ find_proxy_task(struct rq *rq, struct ta
>> clear_task_blocked_on(p, PROXY_WAKING);
>> return p;
>> }
>> - goto force_return;
>> + goto deactivate;
>> }

This makes sense if we preserve the !TASK_RUNNING + p->blocked_on
invariant since we'll definitely get a wakeup here.

>>
>> /*
>> @@ -6812,7 +6758,7 @@ find_proxy_task(struct rq *rq, struct ta
>> __clear_task_blocked_on(p, NULL);
>> return p;
>> }
>> - goto force_return;
>> + goto deactivate;

This too makes sense considering !owner implies some task will be woken
up but ... if we take this task off and another task steals the mutex,
this task will no longer be able to proxy it since it is completely
blocked now.

Probably not desired. We should at least let it run and see if it can
get the mutex and evaluate the "p->blocked_on" again since !owner is
a limbo state.

>> }
>>
>> if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
>> @@ -6891,12 +6837,7 @@ find_proxy_task(struct rq *rq, struct ta
>> return owner;
>>
>> deactivate:
>> - if (proxy_deactivate(rq, donor))
>> - return NULL;
>> - /* If deactivate fails, force return */
>> - p = donor;
>> -force_return:
>> - proxy_force_return(rq, rf, p);
>> + proxy_deactivate(rq, donor);
>> return NULL;
>> migrate_task:
>> proxy_migrate_task(rq, rf, p, owner_cpu);
>
> So I like getting rid of proxy_force_return(), but its not clear to me
> that proxy_deactivate() is what we want to do in these
> find_proxy_task() edge cases.
>
> It feels like if we are already racing with ttwu, deactivating the
> task seems like it might open more windows where we might lose the
> wakeup.
>
> In fact, the whole reason we have proxy_force_return() is that earlier
> in the proxy-exec development, when we hit those edge cases we usually
> would return proxy_reschedule_idle() just to drop the rq lock and let
> ttwu do its thing, but there kept on being cases where we would end up
> with lost wakeups.
>
> But I'll give this a shot (and will integrate your ttwu_runnable
> cleanups regardless) and see how it does.

So I added the following in top of Peter's diff on top of
queue:sched/core and it hasn't crashed and burnt yet when running a
handful instances of sched-messaging with a mix of fair and SCHED_RR
priority:

(Includes John's findings from the parallel thread)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5b2b2451720a..e845e3a8ae65 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2160,7 +2160,7 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
dequeue_task(rq, p, flags);
}

-static bool block_task(struct rq *rq, struct task_struct *p, unsigned long task_state)
+static void block_task(struct rq *rq, struct task_struct *p, unsigned long task_state)
{
int flags = DEQUEUE_NOCLOCK;

@@ -3696,6 +3696,20 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
}
}

+static void zap_balance_callbacks(struct rq *rq);
+
+static inline void proxy_reset_donor(struct rq *rq)
+{
+#ifdef CONFIG_SCHED_PROXY_EXEC
+ WARN_ON_ONCE(rq->curr == rq->donor);
+
+ put_prev_set_next_task(rq, rq->donor, rq->curr);
+ rq_set_donor(rq, rq->curr);
+ zap_balance_callbacks(rq);
+ resched_curr(rq);
+#endif
+}
+
/*
* Consider @p being inside a wait loop:
*
@@ -3730,6 +3744,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
return 0;

update_rq_clock(rq);
+ if (p->se.sched_delayed)
+ enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
if (sched_proxy_exec() && p->blocked_on) {
guard(raw_spinlock)(&p->blocked_lock);
struct mutex *lock = p->blocked_on;
@@ -3738,15 +3754,20 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
* TASK_WAKING is a special state and results in
* DEQUEUE_SPECIAL such that the task will actually be
* forced from the runqueue.
+ *
+ * XXX: All of this is now equivalent of
+ * proxy_needs_return() from John's series :-)
*/
- block_task(rq, p, TASK_WAKING);
p->blocked_on = NULL;
+ if (task_current(rq, p))
+ goto out;
+ if (task_current_donor(rq, p))
+ proxy_reset_donor(rq);
+ block_task(rq, p, TASK_WAKING);
return 0;
}
}
-
- if (p->se.sched_delayed)
- enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
+out:
if (!task_on_cpu(rq, p)) {
/*
* When on_rq && !on_cpu the task is preempted, see if
@@ -4256,6 +4277,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
*/
smp_cond_load_acquire(&p->on_cpu, !VAL);

+ /*
+ * We never clear the blocked_on relation on proxy_deactivate.
+ * If we don't clear it here, we have TASK_RUNNING + p->blocked_on
+ * when waking up. Since this is a fully blocked, off CPU task
+ * waking up, it should be safe to clear the blocked_on relation.
+ */
+ if (task_is_blocked(p))
+ clear_task_blocked_on(p, NULL);
+
cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
if (task_cpu(p) != cpu) {
if (p->in_iowait) {
@@ -6977,6 +7007,10 @@ static void __sched notrace __schedule(int sched_mode)
switch_count = &prev->nvcsw;
}

+ /* See: https://github.com/kudureranganath/linux/commit/0d6a01bb19db39f045d6f0f5fb4d196500091637 */
+ if (!prev_state && task_is_blocked(prev))
+ clear_task_blocked_on(prev, NULL);
+
pick_again:
assert_balance_callbacks_empty(rq);
next = pick_next_task(rq, rq->donor, &rf);
---

Now, there are obviously some sharp edges that I have highlighted above
which may affect performance and correctness to some extent but once we
have sleeping owner bits, it should all go away.

Anyways, I'll let you bash me now on why that try_to_wake_up() hunk
might be totally wrong and dangerous :-)

--
Thanks and Regards,
Prateek