Re: [patch] fix the softlockup watchdog to actually work

From: Andrew Morton
Date: Thu Jul 19 2007 - 03:24:01 EST


On Tue, 17 Jul 2007 17:49:34 +0200 Ingo Molnar <mingo@xxxxxxx> wrote:

> Subject: fix the softlockup watchdog to actually work
> From: Ingo Molnar <mingo@xxxxxxx>
>
> this Xen related commit:
>
> commit 966812dc98e6a7fcdf759cbfa0efab77500a8868
> Author: Jeremy Fitzhardinge <jeremy@xxxxxxxx>
> Date: Tue May 8 00:28:02 2007 -0700
>
> Ignore stolen time in the softlockup watchdog
>
> broke the softlockup watchdog to never report any lockups. (!)
>
> print_timestamp defaults to 0, this makes the following condition
> always true:
>
> if (print_timestamp < (touch_timestamp + 1) ||
>
> and we'll in essence never report soft lockups.
>
> apparently the functionality of the soft lockup watchdog was never
> actually tested with that patch applied ...
>
> [this is -stable material too.]

This seems terribly sensitive.

Someone has broken the Vaio (shock, horror). It now has mysterious
jerkiness: when leaning on autorepeat it stalls for maybe 0.25 seconds
every 1.5 seconds. The stalls are far less than a second. Yet this
is enough to trigger random softlockup warnings.

Some of those warnings are below. Note that the traces are all pretty
useless, as softlockup warnings so often seem to be.

Of course, it could be that whatever is causing these pauses really _is_
stalling for a whole second occasionally, dunno. But I didn't notice any
long stalls in the console output when a particular storm of softlockup
warnings came out.

But I'll sit on this patch for a while until this gets sorted out.
Meanwhile, please double-check the elapsed-time arithmetic in there,
maybe do a bit of runtime testing?



[ 78.820961] BUG: soft lockup detected on CPU#0!
[ 78.821083] [<c0122475>] update_process_times+0x32/0x54
[ 78.821216] [<c012fe7a>] tick_sched_timer+0x61/0x9c
[ 78.821340] [<c012c2e7>] hrtimer_interrupt+0x142/0x1d4
[ 78.821463] [<c012fe19>] tick_sched_timer+0x0/0x9c
[ 78.821587] [<c012f74a>] tick_do_broadcast+0x1f/0x3f
[ 78.821707] [<c012fa01>] tick_handle_oneshot_broadcast+0x47/0x72
[ 78.821852] [<c01067ca>] timer_interrupt+0x1a/0x20
[ 78.821968] [<c014291e>] handle_IRQ_event+0x1a/0x3f
[ 78.822089] [<c0143521>] handle_edge_irq+0x9d/0xcc
[ 78.822206] [<c0105d7b>] do_IRQ+0x53/0x6c
[ 78.822307] [<c012f4f0>] tick_notify+0x15c/0x208
[ 78.822422] [<c01044cf>] common_interrupt+0x23/0x28
[ 78.822539] [<c012f1d4>] clockevents_notify+0x8/0x36
[ 78.822663] [<c020d199>] acpi_processor_idle+0x1d2/0x36d
[ 78.822798] [<c0102345>] cpu_idle+0x44/0x5e
[ 78.822900] [<c03baa8d>] start_kernel+0x26d/0x275
[ 78.823017] [<c03ba3fe>] unknown_bootoption+0x0/0x202
[ 78.823142] =======================
[ 106.282830] BUG: soft lockup detected on CPU#0!
[ 106.282967] [<c0122475>] update_process_times+0x32/0x54
[ 106.283116] [<c012fe7a>] tick_sched_timer+0x61/0x9c
[ 106.283255] [<c012c2e7>] hrtimer_interrupt+0x142/0x1d4
[ 106.283391] [<c012fe19>] tick_sched_timer+0x0/0x9c
[ 106.283530] [<c012f74a>] tick_do_broadcast+0x1f/0x3f
[ 106.283663] [<c012fa01>] tick_handle_oneshot_broadcast+0x47/0x72
[ 106.283821] [<c01067ca>] timer_interrupt+0x1a/0x20
[ 106.283949] [<c014291e>] handle_IRQ_event+0x1a/0x3f
[ 106.284084] [<c0143521>] handle_edge_irq+0x9d/0xcc
[ 106.284215] [<c0105d7b>] do_IRQ+0x53/0x6c
[ 106.284326] [<c012f4f0>] tick_notify+0x15c/0x208
[ 106.284455] [<c01044cf>] common_interrupt+0x23/0x28
[ 106.284587] [<c012f1d4>] clockevents_notify+0x8/0x36
[ 106.284725] [<c020d199>] acpi_processor_idle+0x1d2/0x36d
[ 106.284875] [<c0102345>] cpu_idle+0x44/0x5e
[ 106.284988] [<c03baa8d>] start_kernel+0x26d/0x275
[ 106.285117] [<c03ba3fe>] unknown_bootoption+0x0/0x202
[ 106.285257] =======================
[ 109.266423] BUG: soft lockup detected on CPU#0!
[ 109.266558] [<c0122475>] update_process_times+0x32/0x54
[ 109.266703] [<c012fe7a>] tick_sched_timer+0x61/0x9c
[ 109.270745] [<c012c2e7>] hrtimer_interrupt+0x142/0x1d4
[ 109.274790] [<c012fe19>] tick_sched_timer+0x0/0x9c
[ 109.278865] [<c012f74a>] tick_do_broadcast+0x1f/0x3f
[ 109.282950] [<c012fa01>] tick_handle_oneshot_broadcast+0x47/0x72
[ 109.287026] [<c01067ca>] timer_interrupt+0x1a/0x20
[ 109.291012] [<c014291e>] handle_IRQ_event+0x1a/0x3f
[ 109.294950] [<c0143521>] handle_edge_irq+0x9d/0xcc
[ 109.298864] [<c0105d7b>] do_IRQ+0x53/0x6c
[ 109.302818] [<c012f4f0>] tick_notify+0x15c/0x208
[ 109.306740] [<c01044cf>] common_interrupt+0x23/0x28
[ 109.310641] [<c012f1d4>] clockevents_notify+0x8/0x36
[ 109.314543] [<c020d199>] acpi_processor_idle+0x1d2/0x36d
[ 109.318461] [<c0102345>] cpu_idle+0x44/0x5e
[ 109.322348] [<c03baa8d>] start_kernel+0x26d/0x275
[ 109.326267] [<c03ba3fe>] unknown_bootoption+0x0/0x202
[ 109.330188] =======================

(ah, the Vaio breakage seems to be -mm-only, whew)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/