Re: [RFC PATCH v15 2/7] locking/mutex: Rework task_struct::blocked_on
From: Steven Rostedt
Date: Thu Mar 13 2025 - 06:14:04 EST
FYI, this is useful for Masami's "hung task" work that will show what
tasks a hung task is blocked on in a crash report.
https://lore.kernel.org/all/174046694331.2194069.15472952050240807469.stgit@xxxxxxxxxxxxxxxxxxxxxxxxxxxx/
-- Steve
On Wed, 12 Mar 2025 15:11:32 -0700
John Stultz <jstultz@xxxxxxxxxx> wrote:
> From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>
> Track the blocked-on relation for mutexes, to allow following this
> relation at schedule time.
>
> task
> | blocked-on
> v
> mutex
> | owner
> v
> task
>
> This all will be used for tracking blocked-task/mutex chains
> with the prox-execution patch in a similar fashion to how
> priority inheritance is done with rt_mutexes.
>
> For serialization, blocked-on is only set by the task itself
> (current). And both when setting or clearing (potentially by
> others), is done while holding the mutex::wait_lock.
>
> Cc: Joel Fernandes <joelagnelf@xxxxxxxxxx>
> Cc: Qais Yousef <qyousef@xxxxxxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Juri Lelli <juri.lelli@xxxxxxxxxx>
> Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> Cc: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
> Cc: Valentin Schneider <vschneid@xxxxxxxxxx>
> Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
> Cc: Ben Segall <bsegall@xxxxxxxxxx>
> Cc: Zimuzo Ezeozue <zezeozue@xxxxxxxxxx>
> Cc: Mel Gorman <mgorman@xxxxxxx>
> Cc: Will Deacon <will@xxxxxxxxxx>
> Cc: Waiman Long <longman@xxxxxxxxxx>
> Cc: Boqun Feng <boqun.feng@xxxxxxxxx>
> Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxx>
> Cc: Metin Kaya <Metin.Kaya@xxxxxxx>
> Cc: Xuewen Yan <xuewen.yan94@xxxxxxxxx>
> Cc: K Prateek Nayak <kprateek.nayak@xxxxxxx>
> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Cc: Daniel Lezcano <daniel.lezcano@xxxxxxxxxx>
> Cc: Suleiman Souhlal <suleiman@xxxxxxxxxx>
> Cc: kernel-team@xxxxxxxxxxx
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> [minor changes while rebasing]
> Signed-off-by: Juri Lelli <juri.lelli@xxxxxxxxxx>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> Signed-off-by: Connor O'Brien <connoro@xxxxxxxxxx>
> [jstultz: Fix blocked_on tracking in __mutex_lock_common in error paths]
> Signed-off-by: John Stultz <jstultz@xxxxxxxxxx>
> ---
> v2:
> * Fixed blocked_on tracking in error paths that was causing crashes
> v4:
> * Ensure we clear blocked_on when waking ww_mutexes to die or wound.
> This is critical so we don't get circular blocked_on relationships
> that can't be resolved.
> v5:
> * Fix potential bug where the skip_wait path might clear blocked_on
> when that path never set it
> * Slight tweaks to where we set blocked_on to make it consistent,
> along with extra WARN_ON correctness checking
> * Minor comment changes
> v7:
> * Minor commit message change suggested by Metin Kaya
> * Fix WARN_ON conditionals in unlock path (as blocked_on might already
> be cleared), found while looking at issue Metin Kaya raised.
> * Minor tweaks to be consistent in what we do under the
> blocked_on lock, also tweaked variable name to avoid confusion
> with label, and comment typos, as suggested by Metin Kaya
> * Minor tweak for CONFIG_SCHED_PROXY_EXEC name change
> * Moved unused block of code to later in the series, as suggested
> by Metin Kaya
> * Switch to a tri-state to be able to distinguish from waking and
> runnable so we can later safely do return migration from ttwu
> * Folded together with related blocked_on changes
> v8:
> * Fix issue leaving task BO_BLOCKED when calling into optimistic
> spinning path.
> * Include helper to better handle BO_BLOCKED->BO_WAKING transitions
> v9:
> * Typo fixup pointed out by Metin
> * Cleanup BO_WAKING->BO_RUNNABLE transitions for the !proxy case
> * Many cleanups and simplifications suggested by Metin
> v11:
> * Whitespace fixup pointed out by Metin
> v13:
> * Refactor set_blocked_on helpers clean things up a bit
> v14:
> * Small build fixup with PREEMPT_RT
> v15:
> * Improve consistency of names for functions that assume blocked_lock
> is held, as suggested by Peter
> * Use guard instead of separate spinlock/unlock calls, also suggested
> by Peter
> * Drop blocked_on_state tri-state for now, as its not needed until
> later in the series, when we get to proxy-migration and return-
> migration.
> ---
> include/linux/sched.h | 5 +----
> kernel/fork.c | 3 +--
> kernel/locking/mutex-debug.c | 9 +++++----
> kernel/locking/mutex.c | 19 +++++++++++++++++++
> kernel/locking/ww_mutex.h | 18 ++++++++++++++++--
> 5 files changed, 42 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 1462f2c70aefc..03775c44b7073 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1212,10 +1212,7 @@ struct task_struct {
> struct rt_mutex_waiter *pi_blocked_on;
> #endif
>
> -#ifdef CONFIG_DEBUG_MUTEXES
> - /* Mutex deadlock detection: */
> - struct mutex_waiter *blocked_on;
> -#endif
> + struct mutex *blocked_on; /* lock we're blocked on */
>
> #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
> int non_block_count;
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 735405a9c5f32..38f055082d07a 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2357,9 +2357,8 @@ __latent_entropy struct task_struct *copy_process(
> lockdep_init_task(p);
> #endif
>
> -#ifdef CONFIG_DEBUG_MUTEXES
> p->blocked_on = NULL; /* not blocked yet */
> -#endif
> +
> #ifdef CONFIG_BCACHE
> p->sequential_io = 0;
> p->sequential_io_avg = 0;
> diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
> index 6e6f6071cfa27..758b7a6792b0c 100644
> --- a/kernel/locking/mutex-debug.c
> +++ b/kernel/locking/mutex-debug.c
> @@ -53,17 +53,18 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
> {
> lockdep_assert_held(&lock->wait_lock);
>
> - /* Mark the current thread as blocked on the lock: */
> - task->blocked_on = waiter;
> + /* Current thread can't be already blocked (since it's executing!) */
> + DEBUG_LOCKS_WARN_ON(task->blocked_on);
> }
>
> void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
> struct task_struct *task)
> {
> + struct mutex *blocked_on = READ_ONCE(task->blocked_on);
> +
> DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
> DEBUG_LOCKS_WARN_ON(waiter->task != task);
> - DEBUG_LOCKS_WARN_ON(task->blocked_on != waiter);
> - task->blocked_on = NULL;
> + DEBUG_LOCKS_WARN_ON(blocked_on && blocked_on != lock);
>
> INIT_LIST_HEAD(&waiter->list);
> waiter->task = NULL;
> diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
> index b36f23de48f1b..37d1966970617 100644
> --- a/kernel/locking/mutex.c
> +++ b/kernel/locking/mutex.c
> @@ -627,6 +627,8 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> goto err_early_kill;
> }
>
> + WARN_ON(current->blocked_on);
> + current->blocked_on = lock;
> set_current_state(state);
> trace_contention_begin(lock, LCB_F_MUTEX);
> for (;;) {
> @@ -663,6 +665,12 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
>
> first = __mutex_waiter_is_first(lock, &waiter);
>
> + /*
> + * As we likely have been woken up by task
> + * that has cleared our blocked_on state, re-set
> + * it to the lock we are trying to aquire.
> + */
> + current->blocked_on = lock;
> set_current_state(state);
> /*
> * Here we order against unlock; we must either see it change
> @@ -683,6 +691,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> }
> raw_spin_lock_irqsave(&lock->wait_lock, flags);
> acquired:
> + current->blocked_on = NULL;
> __set_current_state(TASK_RUNNING);
>
> if (ww_ctx) {
> @@ -712,9 +721,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
> return 0;
>
> err:
> + current->blocked_on = NULL;
> __set_current_state(TASK_RUNNING);
> __mutex_remove_waiter(lock, &waiter);
> err_early_kill:
> + WARN_ON(current->blocked_on);
> trace_contention_end(lock, ret);
> raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
> debug_mutex_free_waiter(&waiter);
> @@ -924,6 +935,14 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
> next = waiter->task;
>
> debug_mutex_wake_waiter(lock, waiter);
> + /*
> + * Unlock wakeups can be happening in parallel
> + * (when optimistic spinners steal and release
> + * the lock), so blocked_on may already be
> + * cleared here.
> + */
> + WARN_ON(next->blocked_on && next->blocked_on != lock);
> + next->blocked_on = NULL;
> wake_q_add(&wake_q, next);
> }
>
> diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
> index 37f025a096c9d..00db40946328e 100644
> --- a/kernel/locking/ww_mutex.h
> +++ b/kernel/locking/ww_mutex.h
> @@ -284,6 +284,14 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
> #ifndef WW_RT
> debug_mutex_wake_waiter(lock, waiter);
> #endif
> + /*
> + * When waking up the task to die, be sure to clear the
> + * blocked_on pointer. Otherwise we can see circular
> + * blocked_on relationships that can't resolve.
> + */
> + WARN_ON(waiter->task->blocked_on &&
> + waiter->task->blocked_on != lock);
> + waiter->task->blocked_on = NULL;
> wake_q_add(wake_q, waiter->task);
> }
>
> @@ -331,9 +339,15 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
> * it's wounded in __ww_mutex_check_kill() or has a
> * wakeup pending to re-read the wounded state.
> */
> - if (owner != current)
> + if (owner != current) {
> + /*
> + * When waking up the task to wound, be sure to clear the
> + * blocked_on pointer. Otherwise we can see circular
> + * blocked_on relationships that can't resolve.
> + */
> + owner->blocked_on = NULL;
> wake_q_add(wake_q, owner);
> -
> + }
> return true;
> }
>