Re: [feature] automatically detect hung TASK_UNINTERRUPTIBLE tasks

From: Andi Kleen
Date: Sun Dec 02 2007 - 16:19:37 EST


On Sun, Dec 02, 2007 at 10:10:27PM +0100, Ingo Molnar wrote:
> what if you considered - just for a minute - the possibility of this
> debug tool being the thing that actually animates developers to fix such
> long delay bugs that have bothered users for almost a decade meanwhile?

Throwing frequent debugging messages for non buggy cases will
just lead to people generally ignore softlockups.

I don't think runtime instrumentation is the way to introduce
TASK_KILLABLE in general. The only way there is people going through
the source and identify places where it makes sense.

>
> Until now users had little direct recourse to get such problems fixed.
> (we had sysrq-t, but that included no real metric of how long a task was

Actually task delay accounting can measure this now. iirc someone
had a latencytop based on it already.

> blocked, so there was no direct link in the typical case and users had
> no real reliable tool to express their frustration about unreasonable
> delays.)
>
> Now this changes: they get a "smoking gun" backtrace reported by the
> kernel, and blamed on exactly the place that caused that unreasonable
> delay. And it's not like the kernel breaks - at most 10 such messages
> are reported per bootup.
>
> We increase the delay timeout to say 300 seconds, and if the system is
> under extremely high IO load then 120+ might be a reasonable delay, so
> it's all tunable and runtime disable-able anyway. So if you _know_ that
> you will see and tolerate such long delays, you can tweak it - but i can

This means the user has to see their kernel log fill by such
messages at least once - do a round trip to some mailing list to
explain that it is expected and not a kernel bug - then tweak
some obscure parameters. Doesn't seem like a particular fruitful
procedure to me.

> tell you with 100% certainty that 99.9% of the typical Linux users do
> not characterize such long delays as "correct behavior".

It's about robustness, not the typical case.
Throwing backtraces when something slightly unusual happens is not a robust system.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/