Re: [PATCH][2.5] hangcheck-timer

From: john stultz (johnstul@us.ibm.com)
Date: Mon Jan 20 2003 - 20:42:16 EST


> Attached is a patch adding hangcheck-timer. It is used to detect
> when the system goes out to lunch for a period of time, and then
> returns. This is interesting for debugging drivers as well as for
> clustering environments.
> The module sets a timer. When the timer goes off, it then uses
> get_cycles() to determine how much real time has passed.

get_cycles() is a poor method for determining "real time".
Please use do_gettimeofday().

> On a normal system, the real elapsed time will be almost
> identical to the expected timer duration. However, if a device decided
> to udelay for 60 seconds (or some other circumstance), the module takes

I believe you mean "udelay for 60 micro-seconds"

> + * The hangcheck-timer driver uses the TSC to catch delays that
> + * jiffies does not notice. A timer is set. When the timer fires, it
> + * checks whether it was delayed and if that delay exceeds a given
> + * margin of error. The hangcheck_tick module paramter takes the timer
> + * duration in seconds. The hangcheck_margin parameter defines the
> + * margin of error, in seconds. The defaults are 60 seconds for the
> + * timer and 180 seconds for the margin of error. IOW, a timer is set
> + * for 60 seconds. When the timer fires, the callback checks the
> + * actual duration that the timer waited. If the duration exceeds the
> + * alloted time and margin (here 60 + 180, or 240 seconds), the machine
> + * is restarted. A healthy machine will have the duration match the
> + * expected timeout very closely.
> + */

Wait, so 180 seconds is the margin of error?

[snip]

> +#define DEFAULT_IOFENCE_MARGIN 60 /* Default fudge factor, in seconds */
> +#define DEFAULT_IOFENCE_TICK 180 /* Default timer timeout, in seconds */
> +

ok, time in seconds...

> +static int hangcheck_tick = DEFAULT_IOFENCE_TICK;
> +static int hangcheck_margin = DEFAULT_IOFENCE_MARGIN;
> +static int hangcheck_reboot = 0; /* Do not reboot */
> +

so hangcheck_margin's unit is seconds...

[snip]
> +static void hangcheck_fire(unsigned long data)
> +{
> + unsigned long long cur_tsc, tsc_diff;
> +
> + cur_tsc = get_cycles();
> +
> + if (cur_tsc > hangcheck_tsc)
> + tsc_diff = cur_tsc - hangcheck_tsc;
> + else
> + tsc_diff = (cur_tsc + (~0ULL - hangcheck_tsc)); /* or something */
> +
[snip]
> +
> + if (tsc_diff > hangcheck_tsc_margin) {

but now we're using it to compare cycles! 180sec != 180 cycles

Additionally, this code doesn't take systems that have unsync'ed TSCs,
or systems that change cpu frequency into account. Again, please use
do_gettimeofday(). Then you can then talk about the values returned in
secs and usecs, and I believe things will be much more clear.

thanks
-john

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Jan 23 2003 - 22:00:25 EST