Re: [PATCH v24 09/11] sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case

From: K Prateek Nayak

Date: Wed Mar 11 2026 - 06:07:03 EST

Hello John,

On 3/11/2026 12:15 PM, John Stultz wrote:
> On Thu, Jan 1, 2026 at 1:53 AM K Prateek Nayak <kprateek.nayak@xxxxxxx> wrote:
>> On 11/25/2025 4:01 AM, John Stultz wrote:
>>> @@ -4197,8 +4260,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>>> */
>>> scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
>>> smp_mb__after_spinlock();
>>> - if (!ttwu_state_match(p, state, &success))
>>> - break;
>>> + if (!ttwu_state_match(p, state, &success)) {
>>> + /*
>>> + * If we're already TASK_RUNNING, and PROXY_WAKING
>>> + * continue on to ttwu_runnable check to force
>>> + * proxy_needs_return evaluation
>>> + */
>>> + if (!proxy_task_runnable_but_waking(p))
>>> + break;
>>> + }
>>
>> I don't like the fact that we have to go against the ttwu_state_match()
>> machinery here or the fact that proxy_deactivate() has to check for
>> "TASK_RUNNING" as a short-cut to detect wakeups. It makes all these bits
>> a bit more painful to maintain IMO.
>>
>> Here is what I've learnt so far:
>>
>> o Task can go into an interruptible sleep where a signal can cause the
>> task to give up its attempt to acquire the lock ("err" path in
>> __mutex_lock_common()) - This means we'll have to re-evaluate the
>> "p->blocked_on" by running it.
>>
>> o If we weren't running with the PROXY_EXEC, a task that was blocked
>> (including delayed) on a mutex will definitely receive a wakeup. With
>> proxy, the only difference is that that wakeup path will now see
>> "task_on_rq_queued()" and will have to go though ttwu_runnable() under
>> rq_lock.
>>
>> o Everything else is handled in __schedule() under the rq_lock again
>> and all paths that drop the rq_lock will go through a re-pick.
>>
>> o Once we deactivate a donor (via proxy_deactivate()), the "blocked_on"
>> relation has no real use since we'll always be PROXY_WAKING when we
>> come back (or we have received a signal and that blocked_on relation
>> now needs a re-evaluation anyways)
>>
>> I'm sure a bunch of these hold up for RWSEM too.
>>
>> With that in mind, we can actually generalize ttwu_runnable() and
>> proxy_needs_return() to clear "p->blocked_on", even for !PROXY_WAKING
>> since a wakeup implies we have to re-evaluate the "p->blocked_on".
>>
>> We already have clear_task_blocked_on() within the rq_lock critical
>> section (except for when task is running) and we ensure blocked donors
>> cannot go TASK_RUNNING with task_is_blocked() != NULL.
>>
>>
>> This is what I had in mind based on top of commit d424d28ea93
>> ("sched: Migrate whole chain in proxy_migrate_task()") on
>> "proxy-exec-v24-6.18-rc6" branch:
>>
>> (lightly tested with sched-messaging; Haven't seen any splats
>> or hung-task yet but from my experience, it doesn't hold any
>> water when my suggestion meets the real-world :-)
> ...
>>
>> I guess ignorance is bliss and all this might be a terrible idea but
>> I like how proxy_wake_up_donor() (aka proxy_needs_return()) ends up
>> being much simpler now and the whole:
>>
>> blocked_donor (!TASK_RUNNING && task_is_blocked())
>> |
>> v
>> wakeup (clear_task_blocked_on() + ttwu())
>> |
>> v
>> runnable (TASK_RUNNING && !task_is_blocked())
>>
>> is easier to reason about IMO and should also help sched-ext since we
>> now have defined points to notify where the proxy starts and ends for
>> them to hook onto.
>>
>> I think I lost the "wake_cpu" check in proxy_needs_return() somewhere
>> down the line but doing a block + wakeup irrespective seems to be the
>> right thing since in absence of proxy, !task_current() would have
>> blocked and received a wakeup anyways.
>>
>> Sorry for dumping a bunch of radical suggestions during the holidays.
>> Hope we can flush out a bunch more of proxy execution (if not all of
>> it) this year :-)
>
> Apologies, I've been dragging my feet a bit on this email, as its a
> fair amount of change to get my head around. :)

Sorry about that! The timing too was not ideal given the holiday
season! Thank you again for taking a look at it.

> As always, I *really* appreciate your thoughtful feedback.

Likewise :-)

>
> Running with your patch, I do get quite a few splats from the
> assert_balance_callbacks_empty() check (which then later causes a
> deadlock from lockdep on the console lock), but that seems to be due
> to proxy_needs_return()'s calling proxy_reset_donor() which does a
> put_prev_set_next() which might enqueue a balance callback that we
> then don't run when we drop the rq lock in ttwu_runnable(). Zapping
> the callback seems to resolve this.

I clearly didn't test it enough! But zap seems to be the right thing
to do since we are on our way to pick anyways and we'll restore those
callbacks.

>
> From there, I've been trying to separate and understand the different
> threads of change and their dependencies so that I can apply them to
> the right point in the series, but its subtle and the changes do seem
> very inter-dependent.

Yes, sorry about that! I should have probably given a tree with the
breakdown.

>
> The main semantic rework of task state and blocked_on is interesting.
> I definitely like the idea of simplifying the ttwu paths. But the
> change to also feels a little messy to me, so I'm not 100% on adopting
> it yet, but maybe I need to just get more familiar with it or combine
> the logic in some form.

So my main gripe was the position of proxy_task_runnable_but_waking()
and the TASK_RUNNING check in the proxy_deactivate() path - If we fail
ttwu_state_match(), we don't need a wakeup but with proxy we get the
additional TASK_RUNNING + PROXY_WAKING handling which seems out of
place.

TASK_RUNNING + PROXY_WAKING means task is queued on the CPU, and hasn't
blocked yet - running / preempted:

o If it is preempted, find_proxy_task() will realize PROXY_WAKING and
evaluate a NEEDS_RETURN.

o If it is running, it will simply not block but will have PROXY_WAKING
which find_proxy_task() should again handle as it'll either return
curr on seeing PROXY_WAKING or we'll go to the above case of being
preempted with PROXY_WAKING.

Now that I think of it, why exactly did we need it? My memory is a
little hazy on this matter.

>
> The suggestion on donor wakeup to just set rq->donor to rq->curr
> sounds really nice. I've been thinking of doing similar but was
> hesitant as the donor change has already been a source of races and
> I've been wary to rock the boat, but I'd like to pick this up.

Based on my reading, it is all serialized by the rq_lock so the worst
case scenario is the RCU readers seeing the stale rq->donor for a bit
but that happens today anyways so it must be handled.

> However
> the put_prev_set_next_task() on something other than idle has numerous
> subtle side effects that are giving me grief (seem to be hitting a
> case where right as rq->curr is dequeued and instead set as
> sched_delayed, we wake the donor from another cpu, tripping warnings
> trying to set rq->curr as next when its sched_delayed).

Is it the WARN_ON() in __set_next_task_fair()? If so, it can happen
like this:

CPU0: schedule() CPU1: handoff
================ =============
rq_lock(CPU0)
try_to_block_task(rq->curr) /* Delayed */
donor = pick_next_task()
set_rq_donor(donor) /* Donor has blocked_on */ set_task_blocked_on_waking(donor)
find_proxy_task() wake_up_task(donor)
if (mutex == PROXY_WAKING) try_to_wake_up()
action = NEEDS_RETURN; pi_lock(donor)
break ...
NEEDS_RETURN: ttwu_runnable()
proxy_force_return(donor) rq_lock(rq(CPU0) ...
raw_spin_rq_unlock(rq(CPU0)) ... /* Gets the rq_lock */
pi_lock(donor) ... proxy_needs_return()
proxy_reset_donor(rq(CPU0)->curr)
set_next_task_fair(curr)
!!! Woops! rq->curr is delayed! !!!

We set the donor before the proxy pick but the curr is only
modified later once we finalize the task so it can hit this (or I'm
missing something as usual).

I think we are missing a proxy_resched_idle() in the find_proxy_task()
loop when we plan to return migrate the donor before dropping the rq
lock. I see a proxy_resched_idle() in proxy_force_return() but this
is post task_rq_lock() which is too late.

>
> And I definitely like the reworked proxy_force_return() suggestion,
> which is much closer to the wake_up_process() approach I used back in
> v22:
> https://lore.kernel.org/lkml/20250926032931.27663-5-jstultz@xxxxxxxxxx/
> But unfortunately my attempts to use a similar method ran into lots of
> trouble w/ I was reworking the logic for v23, which resulted in the
> messier logic currently in that function.
>
> So I've had difficulty breaking up and isolating the minimal changeset
> to support these simplifications, as they seem to also depend on the
> semantic rework around task state and blocked_on that you've made.
>
> Finally, I appreciate your thorough comments, though I'm wary of
> adding more spots where we're accessing values without locks. I know
> these tricks can be critical for performance, but the scheduler code
> is already so terribly subtle with its many exceptions to locking
> rules that I fret this makes it even harder to think about (though I'm
> maybe overly sensitive here as I've felt particularly dim in my
> efforts to rework your patch :).

Perhaps we can work these step by step later given the series in its
current state has got a good bit of testing. I saw the WIP branch only
a few days back and I do think some of this gets cleaned away later so
I'm not against keeping it like this for now.

>
> So I'm a little stuck as I really want to get the next iteration (v25)
> of the patch series out to the list, but I don't want to be ignoring
> your suggestions here. Since the v25 series has added a number of
> earlier fixups, the donor-migration chunk is getting quite long (16
> patches currently), so I may just send the early portions (up to
> "Handle blocked-waiter migration (and return migration)" - before we
> get to this try_to_wake_up() change), and maybe I can hopefully sort
> out how to better integrate this feedback for v26 (or whenever the
> first chunk is queued).
>
> That sound ok?

I'm okay with that :-) Again thanks a ton for looking into this.

--
Thanks and Regards,
Prateek