Re: [PATCH 1/2] hung_task: Show the blocker task if the task is hung on mutex
From: Steven Rostedt
Date: Wed Feb 19 2025 - 11:23:46 EST
On Wed, 19 Feb 2025 22:00:49 +0900
"Masami Hiramatsu (Google)" <mhiramat@xxxxxxxxxx> wrote:
> From: Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx>
>
> The "hung_task" shows a long-time uninterruptible slept task, but most
> often, it's blocked on a mutex acquired by another task. Without
> dumping such a task, investigating the root cause of the hung task
> problem is very difficult.
>
> Fortunately CONFIG_DEBUG_MUTEXES=y allows us to identify the mutex
> blocking the task. And the mutex has "owner" information, which can
> be used to find the owner task and dump it with hung tasks.
>
> With this change, the hung task shows blocker task's info like below;
>
We've hit bugs like this in the field a few times, and it was very
difficult to debug. Something like this would have made our lives much
easier!
> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx>
> ---
> kernel/hung_task.c | 38 ++++++++++++++++++++++++++++++++++++++
> kernel/locking/mutex-debug.c | 1 +
> kernel/locking/mutex.c | 9 +++++++++
> kernel/locking/mutex.h | 6 ++++++
> 4 files changed, 54 insertions(+)
>
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index 04efa7a6e69b..d1ce69504090 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -25,6 +25,8 @@
>
> #include <trace/events/sched.h>
>
> +#include "locking/mutex.h"
> +
> /*
> * The number of tasks checked:
> */
> @@ -93,6 +95,41 @@ static struct notifier_block panic_block = {
> .notifier_call = hung_task_panic,
> };
>
> +
> +#ifdef CONFIG_DEBUG_MUTEXES
> +static void debug_show_blocker(struct task_struct *task)
> +{
> + struct task_struct *g, *t;
> + unsigned long owner;
> + struct mutex *lock;
> +
> + if (!task->blocked_on)
> + return;
> +
> + lock = task->blocked_on->mutex;
This is a catch 22. To look at the task's blocked_on, we need the
lock->wait_lock held, otherwise this could be an issue. But to get that
lock, we need to look at the task's blocked_on field! As this can race.
Another thing is that the waiter is on the task's stack. Perhaps we need to
move this into sched/core.c and be able to lock the task's rq. Because even
something like:
waiter = READ_ONCE(task->blocked_on);
May be garbage if the task were to suddenly wake up and run.
Now if we were able to lock the task's rq, which would prevent it from
being woken up, then the blocked_on field would not be at risk of being
corrupted.
-- Steve
> + if (unlikely(!lock)) {
> + pr_err("INFO: task %s:%d is blocked on a mutex, but the mutex is not found.\n",
> + task->comm, task->pid);
> + return;
> + }
> + owner = debug_mutex_get_owner(lock);
> + if (likely(owner)) {
> + /* Ensure the owner information is correct. */
> + for_each_process_thread(g, t)
> + if ((unsigned long)t == owner) {
> + pr_err("INFO: task %s:%d is blocked on a mutex owned by task %s:%d.\n",
> + task->comm, task->pid, t->comm, t->pid);
> + sched_show_task(t);
> + return;
> + }
> + }
> + pr_err("INFO: task %s:%d is blocked on a mutex, but the owner is not found.\n",
> + task->comm, task->pid);
> +}
> +#else
> +#define debug_show_blocker(t) do {} while (0)
> +#endif