Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution

From: K Prateek Nayak

Date: Fri Apr 03 2026 - 11:42:02 EST

Hello Peter,

On 4/3/2026 8:08 PM, Peter Zijlstra wrote:
> On Fri, Apr 03, 2026 at 07:13:29PM +0530, K Prateek Nayak wrote:
>> Hello Peter,
>>
>> On 4/3/2026 4:58 PM, Peter Zijlstra wrote:
>>> On Fri, Apr 03, 2026 at 03:55:22PM +0530, K Prateek Nayak wrote:
>>>>>> @@ -4256,6 +4277,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>>>>>> */
>>>>>> smp_cond_load_acquire(&p->on_cpu, !VAL);
>>>>>>
>>>>>> + /*
>>>>>> + * We never clear the blocked_on relation on proxy_deactivate.
>>>>>> + * If we don't clear it here, we have TASK_RUNNING + p->blocked_on
>>>>>> + * when waking up. Since this is a fully blocked, off CPU task
>>>>>> + * waking up, it should be safe to clear the blocked_on relation.
>>>>>> + */
>>>>>> + if (task_is_blocked(p))
>>>>>> + clear_task_blocked_on(p, NULL);
>>>>>> +
>>>>>
>>>>> Aah, yes! This is when find_proxy_task() hits deactivate() for us and we
>>>>> skip ttwu_runnable(). We still need to clear ->blocked_on.
>>>
>>> I wonder, should we have proxy_deactivate() do this instead?
>>
>> That is one way to tackle that, yes!
>
> OK, lets put it there. At that site we already know task_is_blocked()
> and we get less noise in the wakeup path.
>
> Or should we perhaps put it in block_task() itself? The moment you're
> off the runqueue, ->blocked_on becomes meaningless.

Ack but I'll have to point you to these next bits in John's tree that
handles sleeping owner
https://github.com/johnstultz-work/linux-dev/commit/255c9e933edf5b86e29f9fbde67738fc5041a862

Essentially, going further, when the blocked_on chain encounters a
blocked owner, they'll block themselves and attach onto the sleeping
owner - when the owner wakes up, the whole chain is activated in one go
restoring proxy.

This is why John has suggested that block_task() is probably not the
right place to clear it since, for the sleeping owner bits, we need to
preserve the blocked_on realation until ttwu().

I have some ideas but let me first see if I can stop them from
exploding on my system :-)

>
>>>>>
>>>>>> cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
>>>>>> if (task_cpu(p) != cpu) {
>>>>>> if (p->in_iowait) {
>>>>>> @@ -6977,6 +7007,10 @@ static void __sched notrace __schedule(int sched_mode)
>>>>>> switch_count = &prev->nvcsw;
>>>>>> }
>>>>>>
>>>>>> + /* See: https://github.com/kudureranganath/linux/commit/0d6a01bb19db39f045d6f0f5fb4d196500091637 */
>>>>>> + if (!prev_state && task_is_blocked(prev))
>>>>>> + clear_task_blocked_on(prev, NULL);
>>>>>> +
>>>>>
>>>>> This one confuses me, ttwu() should never results in ->blocked_on being
>>>>> set.
>>>>
>>>> This is from the signal_pending_state() in try_to_block_task() putting
>>>> prev to TASK_RUNNING while still having p->blocked_on.
>>>
>>> Ooh, I misread that. I saw that set_task_blocked_on_waking(, NULL) in
>>> there and though it would clear. Damn, sometimes reading is so very
>>> hard...
>>>
>>> Anyhow, with my changes on, try_to_block_task() is only ever called from
>>> __schedule() on current. This means this could in fact be
>>> clear_task_blocked_on(), right?
>>
>> Yes, that should work too but there is also the case you mentioned on
>> the parallel thread - if ttwu() doesn't see p->blocked_on, it won't
>> clear it and if ttwu_runnable() wins we can simply go into
>> __schedule() with TASK_RUNNING + p->blocked_on with no guarantee that
>> same task will be picked again. Then the task gets preempted and stays
>> in an illegal state.
>>
>> Clearing of ->blocked_on in schedule also safeguards against that race:
>>
>> o Scenario 1: ttwu_runnable() wins
>>
>> CPU0 CPU1
>>
>> LOCK ACQUIRE
>> [W] ->blocked_on = lock [R] ->__state
>> [W] ->__state = state; RMB
>> [R] ->blocked_on
>> [W] ->__state = RUNNING
>>
>> /* Synchronized by rq_lock */
>>
>> MB
>> [R] if (!->__state &&
>> [R] ->blocked_on)
>> [W] ->blocked_on = NULL; /* Safeguard */
>>
>>
>> o Scenario 2: __schedule() wins
>>
>> CPU0 CPU1
>>
>> LOCK ACQUIRE
>> [W] ->blocked_on = lock [R] ->__state
>> [W] ->__state = state;
>>
>> /* Synchronized by rq_lock */
>> MB
>> [W] __block_task()
>>
>> MB
>> /* Full wakeup. */
>> [R] if (->blocked_on)
>> [W] ->blocked_on = NULL
>>
>
> Oh Boohoo :-( Yes, you're quite right.
>
> Is find_proxy_task() that is affected, so this fixup should
> perhaps live in the existing sched_proxy_exec() branch?
>
> Something like so?
>
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2222,6 +2222,13 @@ static bool block_task(struct rq *rq, st
> {
> int flags = DEQUEUE_NOCLOCK;
>
> + /*
> + * We're being taken off the runqueue, cannot still be blocked_on
> + * anything. This also means that delay_dequeue can not have
> + * blocked_on.
> + */
> + clear_task_blocked_on(p, NULL);
> +

Oh! I remember why we couldn't do this - if we clear the
"blocked_on" via proxy_deactivate(), we might be delayed on
the owner's CPU *outside* of the task's affinity.

In that case, we don't want to re-enqueue it there since it'll
violate affinity and the "blocked_on" relation forces it down
the sched_proxy_exec() path in ttwu_runnable() which will fix
it via a full wakeup.

> p->sched_contributes_to_load =
> (task_state & TASK_UNINTERRUPTIBLE) &&
> !(task_state & TASK_NOLOAD) &&
> @@ -6614,7 +6621,7 @@ static bool try_to_block_task(struct rq
> if (signal_pending_state(task_state, p)) {
> WRITE_ONCE(p->__state, TASK_RUNNING);
> *task_state_p = TASK_RUNNING;
> - set_task_blocked_on_waking(p, NULL);
> + clear_task_blocked_on(p, NULL);

This is not strictly required - since we set the "*task_state_p" to
TASK_RUNNING which modifies "prev_state" in __schedule() ...

>
> return false;
> }
> @@ -7043,6 +7050,14 @@ static void __sched notrace __schedule(i
> if (sched_proxy_exec()) {
> struct task_struct *prev_donor = rq->donor;
>
> + /*
> + * There is a race between ttwu() and __mutex_lock_common()
> + * where it is possible for the mutex code to call into
> + * schedule() with ->blocked_on still set.
> + */
> + if (!prev_state && prev->blocked_on)
> + clear_task_blocked_on(prev, NULL);

... we'll just end up seeing !prev_state here and clearing it.

We can keep them separate too to make it very explicit. No
strong feelings.

> +
> rq_set_donor(rq, next);
> if (unlikely(next->blocked_on)) {
> next = find_proxy_task(rq, next, &rf);
>

--
Thanks and Regards,
Prateek