Re: [PATCH v3 2/2] Make hard lockup detection use timestamps

From: Don Zickus
Date: Wed Aug 03 2011 - 15:11:52 EST


On Mon, Aug 01, 2011 at 01:11:27PM -0700, ZAK Magnus wrote:
> On Mon, Aug 1, 2011 at 12:24 PM, Don Zickus <dzickus@xxxxxxxxxx> wrote:
> > One idea I thought of to workaround this is to save the timestamp and the
> > watchdog bool and restore after the stack dump.  It's a cheap hack and I
> > am not to sure about the locking as it might race with
> > touch_nmi_watchdog().  But it gives you an idea what I was thinking.
> Yes, I see. Is the hackiness of it okay?

Hi,

I don't think it is too bad. Most of the stuff is per_cpu and is intended
to be per_cpu. There might be a random case where another cpu is trying
to zero out the watchdog_nmi_touch or watchdog_touch_ts variables.

I was trying to fix the cross-cpu case for watchdog_nmi_touch to eliminate
that problem but Ingo wanted me to implement some panic ratelimit first
(which I lost track of doing). And being in the NMI context and staying
per_cpu should make that case safe I believe, despite the hackiness of it.

The watchdog_touch_ts is only called on another cpu in the
touch_all_softlockup_watchdogs() case, which only happens when the
scheduler is spewing stats currently. This should happen rarely. This
leaves the problem of softlockups being preempted in the interrupt
context and touched by another interrupt handler. I don't know how to
solve this reliably but I think it should be ok most of the time. The
only downside is a premature softlockup I would think.

I can't think of a better way to workaround the problem and still move
forward with your idea of warning on future stalls.

Then again I have been busy here and haven't put enough thought into it.

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/