Re: [PATCH v29 7/9] sched: Add blocked_donor link to task for smarter mutex handoffs

From: K Prateek Nayak

Date: Tue May 19 2026 - 11:47:42 EST

Hello Peter,

On 5/19/2026 8:02 PM, Peter Zijlstra wrote:
> This then brings us to the consumer side of things:
>
>> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
>> index 09534628dc01a..0064b724ccda3 100644
>> --- a/kernel/locking/mutex.c
>> +++ b/kernel/locking/mutex.c
>> @@ -980,7 +980,7 @@ EXPORT_SYMBOL_GPL(ww_mutex_lock_interruptible);
>> static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigned long ip)
>> __releases(lock)
>> {
>> - struct task_struct *next = NULL;
>> + struct task_struct *donor, *next = NULL;
>> struct mutex_waiter *waiter;
>> DEFINE_WAKE_Q(wake_q);
>> unsigned long owner;
>> @@ -1001,6 +1001,12 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
>> MUTEX_WARN_ON(__owner_task(owner) != current);
>> MUTEX_WARN_ON(owner & MUTEX_FLAG_PICKUP);
>>
>> + if (sched_proxy_exec() && current->blocked_donor) {
>> + /* force handoff if we have a blocked_donor */
>> + owner = MUTEX_FLAG_HANDOFF;
>> + break;
>> + }
>> +
>> if (owner & MUTEX_FLAG_HANDOFF)
>> break;
>>
>
> AFAICT this is racy since we don't have preemption disabled.
>
> So we can observe ->blocked_donor (A) set, or (B) unset.
> If (A) we can schedule() right after this (and before taking
> ->wait_lock) and it can be unset when we resume running this task. Or
> (B), the exact opposite.
>
> Now, (A) is harmless, because if ->blocked_donor becomes NULL, the
> hand off code falls back to picking the first on the wait list and
> things just get on.
>
> *However*, (B) might be a problem, because then we will not have the
> HANDOFF bit set even though there is in fact a donor we need to hand off
> to.

I don't think it is that big of a problem: __mutex_unlock_slowpath()
sees current->blocked_donor as NULL but just before it can do:

atomic_long_try_cmpxchg_release(&lock->owner, &owner, __owner_flags(owner))

it is preempted and comes back with a ->blocked_donor.

Once we grab the wait_lock, we realize there is a ->blocked_donor link
and we wake it up instead of first waiter which goes and grabs the
lock once it hits mutex_trylock() for !owner case.

If a concurrent mutex_lock() races with this sequence, the
->blocked_donor that was woken up sees the new owner and goes to proxy
that.

With guard(preempt)(), we see a stable ->blocked_donor as NULL, and we
wake up the first waiter after clearing the "owner" from lock->owner.

If a concurrent mutex_lock() races, it still grabs the mutex and now
it is just the first waiter that realizes it has to go proxy the task
that stole the lock.

One wasted wakeup is incurred in both cases but it seems like no
biggie. Maybe I don't understand the subtleties but stabilizing it
against preemption is definitely a good move since it saves this
debate :-)

> That is, we need this on top, no?
>
> ---
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1242,8 +1242,14 @@ struct task_struct {
> #endif
>
> struct mutex *blocked_on; /* lock we're blocked on */
> - struct task_struct *blocked_donor; /* task that is boosting this task */
> raw_spinlock_t blocked_lock;
> +
> + /*
> + * The task that is boosting this task; a back link for the current
> + * donor stack. Set in schedule() -> find_proxy_task() and only stable
> + * under preempt_disable().
> + */
> + struct task_struct *blocked_donor;
>
> #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER
> /*
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -990,6 +990,14 @@ static noinline void __sched __mutex_unl
> __release(lock);
>
> /*
> + * Ensures the proxy donor stack is stable across unlock and handoff.
> + * Specifically, it avoids the case where current->blocked_donor is
> + * NULL when it is inspected while doing the unlock, but a preemption
> + * before taking the wake_lock would make it set and a hand-off is
> + * missed.
> + */
> + guard(preempt)();
> + /*
> * Release the lock before (potentially) taking the spinlock such that
> * other contenders can get on with things ASAP.
> *
> @@ -1063,7 +1071,8 @@ static noinline void __sched __mutex_unl
> __mutex_handoff(lock, next);
>
> raw_spin_unlock(&current->blocked_lock);
> - raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
> + raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
> + wake_up_q(&wake_q);

In that case, can't we simplify this to:

if (next)
wake_up_process(next);

and save on the wake_q enqueue dequeue overheads?

> }
>
> #ifndef CONFIG_DEBUG_LOCK_ALLOC

--
Thanks and Regards,
Prateek