Re: [PATCH v3] hung_task: deduplicate identical hang reports
From: Google
Date: Sun Jun 21 2026 - 20:19:12 EST
On Sun, 21 Jun 2026 17:37:56 -0400
Aaron Tomlin <atomlin@xxxxxxxxxxx> wrote:
> Currently, during severe lock contention, multiple tasks can hang while
> waiting on the exact same resource. The khungtaskd kthread
> indiscriminately reports every single instance with a stack trace.
> This can roll the kernel ring buffer and prematurely exhaust the
> kernel.hung_task_warnings budget. Consequently, the kernel is left
> entirely blind to subsequent, unrelated deadlocks.
>
> To preserve the warning budget and ring buffer without sacrificing
> observability, introduce a Wait Channel (wchan) and task-state based
> deduplicator:
>
> 1. Implement a lightweight, stack-allocated 64-slot Wait Channel
> (wchan) hash map. Tasks blocked on the exact same wchan during a
> single scan are recognised as sharing the same bottleneck,
> successfully deduplicating contentions even when the callers
> possess entirely disparate call stacks.
Hmm, wouldn't this essentially erase everything that's typically
expected in a standard lock?
Ideally, we'd like to sort by the time the lock was first blocked
and display only the oldest stack.
>
> 2. Introduce a hung_task_reported bit-field in task_struct. If a task
> remains hung across multiple intervals, khungtaskd recognises it
> has already been reported. The bit is safely cleared without
> locks or atomics the moment the task's context switch counter
> increments.
>
> 3. For duplicate tasks, we still print the single-line
> "INFO: task ..." message and trigger tracepoint
> trace_sched_process_hang(). It merely skips calling
> sched_show_task() and debug_show_blocker(), printing a concise
> suppression notice instead.
Ah, OK. So if we need more information, we can record it on trace
ring buffer.
>
> Signed-off-by: Aaron Tomlin <atomlin@xxxxxxxxxxx>
> --
> Changes since v2:
>
> - Replaced the per-round cache flush with a task_struct bit-field for
> persistent cross-scan tracking, mitigating delayed budget exhaustion
>
> - Abandoned exact-stack hashing in favour of Wait Channel hashing
>
> - Transitioned from jhash() to hash_long() to optimise single-pointer
> hashing, and relocated the hash map to the local stack
>
> - Linked to v2: https://lore.kernel.org/lkml/20260620013559.1537893-1-atomlin@xxxxxxxxxxx/
>
> Changes since v1:
>
> - Preserve "INFO:" headers for all hung tasks; suppress only the stack
> dumps for duplicates (Masami Hiramatsu)
>
> - Print a clear notification when a trace is explicitly suppressed
>
> - Add #ifdef CONFIG_STACKTRACE guards to prevent Kconfig build errors
>
> - Optimise overhead by unwinding the stack only if a warning is
> actually going to be printed
>
> - Linked to v1: https://lore.kernel.org/lkml/20260617184841.1447955-1-atomlin@xxxxxxxxxxx/
> ---
> include/linux/sched.h | 3 +++
> kernel/hung_task.c | 32 ++++++++++++++++++++++++++++----
> 2 files changed, 31 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b3204a15d512..e76cf221cc78 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1046,6 +1046,9 @@ struct task_struct {
> /* Used by page_owner=on to detect recursion in page tracking. */
> unsigned in_page_owner:1;
> #endif
> +#ifdef CONFIG_DETECT_HUNG_TASK
> + unsigned hung_task_reported:1;
> +#endif
> #ifdef CONFIG_EVENTFD
> /* Recursion prevention for eventfd_signal() */
> unsigned in_eventfd:1;
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index 6fcc94ce4ca9..5dcce0e7041b 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -25,6 +25,7 @@
> #include <linux/hung_task.h>
> #include <linux/rwsem.h>
> #include <linux/sys_info.h>
> +#include <linux/hash.h>
>
> #include <trace/events/sched.h>
>
> @@ -125,6 +126,7 @@ static bool task_is_hung(struct task_struct *t, unsigned long timeout)
> if (switch_count != t->last_switch_count) {
> t->last_switch_count = switch_count;
> t->last_switch_time = jiffies;
> + t->hung_task_reported = 0;
> return false;
> }
> if (time_is_after_jiffies(t->last_switch_time + timeout * HZ))
> @@ -228,12 +230,14 @@ static inline void debug_show_blocker(struct task_struct *task, unsigned long ti
> * @t: Pointer to the detected hung task.
> * @timeout: Timeout threshold for detecting hung tasks
> * @this_round_count: Count of hung tasks detected in the current iteration
> + * @skip_show_task: Indicating if stack trace should be skipped
> *
> * Print structured information about the specified hung task, if warnings
> * are enabled or if the panic batch threshold is exceeded.
> */
> static void hung_task_info(struct task_struct *t, unsigned long timeout,
> - unsigned long this_round_count)
> + unsigned long this_round_count,
> + unsigned int skip_show_task)
> {
> trace_sched_process_hang(t);
>
> @@ -261,8 +265,12 @@ static void hung_task_info(struct task_struct *t, unsigned long timeout,
> pr_err(" Blocked by coredump.\n");
> pr_err("\"echo 0 > /proc/sys/kernel/hung_task_timeout_secs\""
> " disables this message.\n");
> - sched_show_task(t);
> - debug_show_blocker(t, timeout);
> + if (!skip_show_task) {
> + sched_show_task(t);
> + debug_show_blocker(t, timeout);
> + } else {
> + pr_err(" Stack trace suppressed. Already reported or duplicate wchan\n");
Can we show the wchan hash for each task, so that we can see which
tasks are waiting on the same wchan?
Thanks,
> + }
>
> if (!sysctl_hung_task_warnings)
> pr_info("Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings\n");
> @@ -306,6 +314,9 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
> unsigned long this_round_count;
> int need_warning = sysctl_hung_task_warnings;
> unsigned long si_mask = hung_task_si_mask;
> + unsigned long wchan, wchan_hash[64] = { 0 };
> + unsigned int hash;
> + unsigned int skip_show_task;
>
> /*
> * If the system crashed already then all bets are off,
> @@ -326,6 +337,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
> }
>
> if (task_is_hung(t, timeout)) {
> + skip_show_task = t->hung_task_reported;
> /*
> * Increment the global counter so that userspace could
> * start migrating tasks ASAP. But count the current
> @@ -334,7 +346,19 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
> */
> atomic_long_inc(&sysctl_hung_task_detect_count);
> this_round_count++;
> - hung_task_info(t, timeout, this_round_count);
> +
> + wchan = get_wchan(t);
> + if (wchan) {
> + hash = hash_long(wchan, 6);
> + if (wchan_hash[hash] == wchan)
> + skip_show_task = 1;
> + else
> + wchan_hash[hash] = wchan;
> + }
> +
> + hung_task_info(t, timeout, this_round_count,
> + skip_show_task);
> + t->hung_task_reported = 1;
> }
> }
> unlock:
> --
> 2.51.0
>
--
Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx>