Re: [PATCH v4 1/2] hung_task: Show the blocker task if the task is hung on mutex

From: Lance Yang
Date: Wed Jul 30 2025 - 05:25:05 EST




On 2025/7/30 15:59, Sergey Senozhatsky wrote:
On (25/02/25 16:02), Masami Hiramatsu (Google) wrote:
The "hung_task" shows a long-time uninterruptible slept task, but most
often, it's blocked on a mutex acquired by another task. Without
dumping such a task, investigating the root cause of the hung task
problem is very difficult.

This introduce task_struct::blocker_mutex to point the mutex lock
which this task is waiting for. Since the mutex has "owner"
information, we can find the owner task and dump it with hung tasks.

Note: the owner can be changed while dumping the owner task, so
this is "likely" the owner of the mutex.

With this change, the hung task shows blocker task's info like below;

INFO: task cat:115 blocked for more than 122 seconds.
Not tainted 6.14.0-rc3-00003-ga8946be3de00 #156
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:cat state:D stack:13432 pid:115 tgid:115 ppid:106 task_flags:0x400100 flags:0x00000002
Call Trace:
<TASK>
__schedule+0x731/0x960
? schedule_preempt_disabled+0x54/0xa0
schedule+0xb7/0x140
? __mutex_lock+0x51b/0xa60
? __mutex_lock+0x51b/0xa60
schedule_preempt_disabled+0x54/0xa0
__mutex_lock+0x51b/0xa60
read_dummy+0x23/0x70
full_proxy_read+0x6a/0xc0
vfs_read+0xc2/0x340
? __pfx_direct_file_splice_eof+0x10/0x10
? do_sendfile+0x1bd/0x2e0
ksys_read+0x76/0xe0
do_syscall_64+0xe3/0x1c0
? exc_page_fault+0xa9/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x4840cd
RSP: 002b:00007ffe99071828 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004840cd
RDX: 0000000000001000 RSI: 00007ffe99071870 RDI: 0000000000000003
RBP: 00007ffe99071870 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000001000000 R11: 0000000000000246 R12: 0000000000001000
R13: 00000000132fd3a0 R14: 0000000000000001 R15: ffffffffffffffff
</TASK>
INFO: task cat:115 is blocked on a mutex likely owned by task cat:114.
task:cat state:S stack:13432 pid:114 tgid:114 ppid:106 task_flags:0x400100 flags:0x00000002
Call Trace:
<TASK>
__schedule+0x731/0x960
? schedule_timeout+0xa8/0x120
schedule+0xb7/0x140
schedule_timeout+0xa8/0x120
? __pfx_process_timeout+0x10/0x10
msleep_interruptible+0x3e/0x60
read_dummy+0x2d/0x70
full_proxy_read+0x6a/0xc0
vfs_read+0xc2/0x340
? __pfx_direct_file_splice_eof+0x10/0x10
? do_sendfile+0x1bd/0x2e0
ksys_read+0x76/0xe0
do_syscall_64+0xe3/0x1c0
? exc_page_fault+0xa9/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x4840cd
RSP: 002b:00007ffe3e0147b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004840cd
RDX: 0000000000001000 RSI: 00007ffe3e014800 RDI: 0000000000000003
RBP: 00007ffe3e014800 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000001000000 R11: 0000000000000246 R12: 0000000000001000
R13: 000000001a0a93a0 R14: 0000000000000001 R15: ffffffffffffffff
</TASK>

One thing that gives me a bit of "inconvenience" is that in certain
cases this significantly increases the amount of stack traces to go
through. A distilled real life example:
- task T1 acquires lock L1, attempts to acquire L2
- task T2 acquires lock L2, attempts to acquire L3
- task T3 acquires lock L3, attempts to acquire L1

So we'd now see:
- a backtrace of T1, followed by a backtrace of T2 (owner of L2)
- a backtrace of T2, followed by a backtrace of T3 (owner of L3)
- a backtrace of T3, followed by a backtrace of T1 (owner of L1)

Notice how each task is backtraced twice. I wonder if it's worth it
to de-dup the backtraces. E.g. in

task cat:115 is blocked on a mutex likely owned by task cat:114

if we know that cat:114 is also blocked on a lock, then we probably
can just say "is blocked on a mutex likely owned by task cat:114" and
continue iterating through tasks. That "cat:114" will be backtraced
individually later, as it's also blocked on a lock, owned by another
task.

Does this make any sense?

Good spot! There is room for improvement.

In a deadlock chain like T1->T2->T3->T1, by definition, T1, T2, and T3
are all hung tasks, and the detector's primary responsibility is to
generate a report for each of them. The current implementation, when
reporting on one task, also dumps the backtrace of its blocker.

This results in a task's backtrace being printed twice — once as a
blocker and again as a primary hung task.

Regarding the de-duplication idea: it is elegant, but it does introduce
more complexity into the detector. We should also consider that in many
real-world cases, the blocking chain is just one level deep, where this
isn't an issue, IHMO ;)

Thanks,
Lance