Re: [RFC PATCHv2 1/2] drivers: power: Add watchdog timer to catch drivers which lockup during suspend/resume.

From: Rafael J. Wysocki
Date: Sat May 11 2013 - 20:31:19 EST


On Friday, May 10, 2013 11:13:27 PM Colin Cross wrote:
> On Fri, May 10, 2013 at 2:28 PM, Zoran Markovic
> <zoran.markovic@xxxxxxxxxx> wrote:
> > From: Benoit Goby <benoit@xxxxxxxxxxx>
> >
> > Below is a patch from android kernel that detects a driver suspend/resume
> > lockup and captures dump in the kernel log. Please review and provide
> > comments.
>
> This paragraph should go below the --- line so it doesn't end up in
> the final commit message.
>
> > Rather than hard-lock the kernel, dump the suspend/resume thread stack and
> > BUG() when a driver takes too long to suspend/resume.

And how exactly is that different from hanging the kernel?

> > The timeout is set to 12 seconds to be longer than the usbhid 10
> > second timeout.

Which is kind of arbitrary.

> > Exclude from the watchdog the time spent waiting for children that
> > are resumed asynchronously and time every device, whether or not they
> > resumed synchronously.

What about changing the code to use wait_for_completion_timeout() instead
of wait_for_completion() and actually trying to recover if one of them times
out? [It could try to unregister the device in question and if *that* hangs
indefinitely, *then* commit a panic() or something similar, but if it succeeds,
abort the ongoing suspend or complete the resume without that device.]

Of course, that would involve changing things not to depend on async, but
might be better than slapping a timer on top.

> > This patch is targeted for mobile devices where a suspend/resume lockup
> > could cause a system reboot and catch user's attention. Information
> > about failing device can later be retrieved from captured log in
> > subsequent boot session.
>
> I would take out the phrase "catch user's attention", the intention of
> this patch is actually the opposite - get the system back to working
> normally as fast as possible, while still putting enough information
> to debug the problem into the log.
>
> > The hardware watchdog timer is likely suspended during this time and
> > couldn't be relied upon. The soft-lockup detector would eventually tell
> > that tasks are not scheduled, but would provide little context as to why.
> > The patch hence uses system timer and assumes it is still active while the
> > devices are suspended/resumed.

I must say I'm not particularly impressed by this patch series. I understand
the motivation, but I'm wondering if that's the best we can do.

Thanks,
Rafael


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/