Re: [PATCH v2] hung_task: Add per-round stack trace deduplication
From: Aaron Tomlin
Date: Sun Jun 21 2026 - 17:32:05 EST
On Sun, Jun 21, 2026 at 01:17:18PM +0800, Lance Yang wrote:
> I think your reply still misses most of my concerns ...
>
> >You raise an entirely fair point regarding maintainability; every new
> >control knob indeed carries a permanent cost for the maintainers, and I
> >respect your caution.
>
> Yeah, that matters, IMHO.
>
> >To answer your question regarding real world pain: the primary issue is not
> >merely visual clutter, but the premature exhaustion of the warning budget
> >and the preservation of the kernel ring buffer during cascading failures.
>
> Right, but that still sounds like a very specific case.
>
> When khungtaskd fires, something is already wrong, no?
>
> Even with one identical stack, per-round dedup only helps inside one
> scan. The same stack can still come back in later rounds and burn through
> hung_task_warnings anyway.
>
> And under heavy contention, I would not expect only one stack anyway.
> Different tasks can hang behind different locks or different callers, and
> those stacks can still burn through the warning budget.
>
> >In our production environments, we typically leave
> >kernel.hung_task_warnings at its default value of 10. If a severe lock
> >contention occurs, a single bottleneck can easily cause 10 tasks to hang
> >simultaneously with the exact same stack trace. Under the current logic,
>
> Not sure I buy this premise :)
>
> Same bottleneck does not necessarily mean exactly the same stack.
> Different callers can block on the same lock, and exact-stack dedup won't
> help there.
>
> At least from cases I've looked at, I can't really recall seeing this
> exact pattern often enough to justify a new khungtaskd knob.
>
> >those 10 identical traces will completely exhaust the warning budget.
> >Consequently, the kernel is left entirely blind to any subsequent or
> >completely unrelated deadlocks that might be occurring concurrently, as all
> >further reports are silenced.
>
> I don't think "entirely blind" is accurate.
>
> hung_task_warnings *only* gates printk. We still bump
> hung_task_detect_count and hit trace_sched_process_hang() before that
> gate.
>
> >Furthermore, dumping a full stack trace for every duplicate rapidly injects
> >several of lines of identical noise into dmesg. We have found that this
> >sudden burst frequently rolls the circular ring buffer.
> >
> >Userspace tooling is unfortunately unable to group or analyse logs that
> >have already been evicted before the tool could read them, nor can it
> >recover traces the kernel silently dropped due to an exhausted budget.
> >
> >The deduplicator acts as a telemetry filter, ensuring that the limited
> >warning budget is spent strictly on unique traces rather than redundant
> >noise, thereby preserving the history of the crash and ensuring secondary
> >failures are not obscured.
> >
> >I wanted to clarify the exact operational context and the limitation of
> >relying on userspace. Please let me know if this operational context alters
> >your perspective at all.
> [...]
>
> Aaron, you've done good work in khungtaskd, and some of it is upstream
> already. I do appreciate that!
>
> But this one feels different. Useful locally, maybe, but not something
> the kernel should carry forever.
>
> Anyway, I'll stop here. Still a nack from my side.
Hi Lance,
Understood, and I appreciate you taking the time to outline your concerns
so clearly.
Your point regarding different callers contending on the exact same lock is
particularly salient. You are absolutely right that an exact-stack hash
would fail to deduplicate those instances.
Furthermore, your observation regarding the per-round flushing is spot on.
Even if the stacks were identical, clearing the hash table each iteration
means the system would simply burn through the warning budget on subsequent
scans anyway, ultimately failing to solve the core issue.
To resolve both of these issues while satisfying Masami's requirement to
preserve task-count observability, I have entirely reworked the
architecture for the upcoming v3. The new approach abandons the exact-stack
hash entirely. Instead:
1. It uses a lightweight "Wait Channel" (wchan) hash to successfully
group and deduplicate tasks blocked on the exact same bottleneck,
regardless of their disparate call stacks.
2. It introduces a hung_task_reported bit-field in task_struct to
prevent budget exhaustion across subsequent scans.
3. It still prints the single-line "INFO: task ..." and invokes
the trace_sched_process_hang tracepoint for every task, merely
suppressing the call to sched_show_task() for duplicates.
I plan to send out v3 shortly, and I hope you will find it to be a much
cleaner direction.
Kind regards,
--
Aaron Tomlin