Re: [patch] timer-irq-driven soft-watchdog, cleanups

From: Bartlomiej Zolnierkiewicz
Date: Fri Feb 17 2006 - 09:44:48 EST


On 2/17/06, Ingo Molnar <mingo@xxxxxxx> wrote:
>
> * Bartlomiej Zolnierkiewicz <bzolnier@xxxxxxxxx> wrote:
>
> > I'm still not 100% sure if it was false positive - it looked like from
> > the trace but I find it hard to believe that users wouldn't complain
> > about 10sec stalls [ Soft lockup detector claims to trigger if after
> > 10sec it hasn't been touched - is it really working as advertised?
> > How can we verify this? ].
>
> the watchdog is quite simple: it consists of per-CPU SCHED_FIFO prio 99
> [i.e. highest RT priority] threads that do nothing but:
>
> while (!kthread_should_stop()) {
> msleep_interruptible(1000);
> touch_softlockup_watchdog();
> }
>
> i can think of only one (pretty theoretical) scenario for a false
> positive here: msleep uses timers, which are processed by softirq
> context, which context itself might be delayed. Under extreme load, if
> softirqs get delayed for more than 9 seconds, this _might_ lead to false
> positives. But that i think is highly unlikely in the reported IDE
> cases.

Actually it seems very likely [ if I'm reading the code right ]: ide_intr()
IRQ handler contains "optimization" that it calls ide_do_request() for
the next request if the previous request is completed. Processing the
next request involves waiting for device to become ready (up to 5sec)
and sending first chunk of data out in case of PIO-out protocol.
Moreover if the requests are short we can be hitting "optimization"
case few times in the row. Hard IRQs will still have their chance to be
processed but softirq context can be delayed quite a lot as it is executed
after hard IRQ handler completes.

> in any case, the patch below gets rid of the softirq involvement, and
> makes the soft-watchdog purely timer-irq driven (and a few minor
> cleanups). Could you try it? I have tested it - it correctly detected a
> 11-seconds delay and stayed silent during a 9-seconds delay.
>
> If you still get warnings even with this patch applied, then my very
> strong suspicion is that the 10+ seconds delays in the IDE code are
> real, and not false-positives. If there are such places then the minimum
> we should do is to document them via touch_softlockup_watchdog() ...
> even if you "knew" about such places already.

Fully agreed, where is the patch?

Sorry but I have enough more high priority issues to take care of and
I'm not going to spend any more time on soft lockups even if they are
really problems in IDE subsystem. If this is not fixed before 2.6.16
I'm submitting patch to Linus making DETECT_SOFTLOCKUP depend
on "CONFIG_IDE=n"... at least users will be able to use their systems
instead of seeing lockups.

DETECT_SOFTLOCKUP should be an aim in development not a
method of forcing driver maintainers to work on specific issues...

Thank you for understanding.

Bartlomiej
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/