Re: [PATCH] panic: release stale console lock to always get the logbuf printed out
From: Andrew Morton
Date: Wed Oct 07 2015 - 18:34:15 EST
(cc Jan)
On Wed, 7 Oct 2015 19:02:22 +0200 Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> wrote:
> In some cases we may end up killing the CPU holding the console lock
> while still having valuable data in logbuf. E.g. I'm observing the
> following:
> - A crash is happening on one CPU and console_unlock() is being called on
> some other.
> - console_unlock() tries to print out the buffer before releasing the lock
> and on slow console it takes time.
> - in the meanwhile crashing CPU does lots of printk()-s with valuable data
> (which go to the logbuf) and sends IPIs to all other CPUs.
> - console_unlock() finishes printing previous chunk and enables interrupts
> before trying to print out the rest, the CPU catches the IPI and never
> releases console lock.
Why doesn't the lock-owning CPU release the console lock? Because it
was stopped by smp_send_stop() in panic()?
I don't recall why we stop CPUs in panic(), and of course we didn't
document the reason. I guess it makes sense from the "what else can we
do" point of view, but I wonder if we can just do it later on - that
would fix this problem?
(dumb aside: why doesn't smp_send_stop() stop the calling CPU?)
> This is not the only possible case: in VT/fb subsystems we have many other
> console_lock()/console_unlock() users. Non-masked interrupts (or receiving
> NMI in case of extreme slowness) will have the same result. Getting the
> whole console buffer printed out on crash should be top priority.
Yes, this is a pretty big hole in the logic.
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -23,6 +23,7 @@
> #include <linux/sysrq.h>
> #include <linux/init.h>
> #include <linux/nmi.h>
> +#include <linux/console.h>
>
> #define PANIC_TIMER_STEP 100
> #define PANIC_BLINK_SPD 18
> @@ -147,6 +148,15 @@ void panic(const char *fmt, ...)
>
> bust_spinlocks(0);
>
> + /*
> + * We may have ended up killing the CPU holding the lock and still have
> + * some valuable data in console buffer. Try to acquire the lock and
> + * release it regardless of the result. The release will also print the
> + * buffers out.
> + */
> + console_trylock();
> + console_unlock();
> +
"killing the CPU" is a bit vague. How's this look?
--- a/kernel/panic.c~panic-release-stale-console-lock-to-always-get-the-logbuf-printed-out-fix
+++ a/kernel/panic.c
@@ -149,10 +149,10 @@ void panic(const char *fmt, ...)
bust_spinlocks(0);
/*
- * We may have ended up killing the CPU holding the lock and still have
- * some valuable data in console buffer. Try to acquire the lock and
- * release it regardless of the result. The release will also print the
- * buffers out.
+ * We may have ended up stopping the CPU holding the lock (in
+ * smp_send_stop()) while still having some valuable data in the console
+ * buffer. Try to acquire the lock then release it regardless of the
+ * result. The release will also print the buffers out.
*/
console_trylock();
console_unlock();
_
Does the console_trylock() guarantee that the console lock is now held?
If the console_lock-holding CPU is still running then there's a window
where the above code could enter console_unlock() when nobody's holding
console_lock. If smp_send_stop() always works (synchronously) then
that won't happen.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/