[PATCH v2] kdump: Fix crash_kexec - smp_send_stop race in panic

From: Michael Holzheu
Date: Mon Oct 31 2011 - 08:34:24 EST


Hello Andrew, hello linux-arch,

On Mon, 2011-10-31 at 03:39 -0700, Andrew Morton wrote:
> On Mon, 31 Oct 2011 10:57:16 +0100 Michael Holzheu <holzheu@xxxxxxxxxxxxxxxxxx> wrote:
>
> > > Should this be done earlier in the function? As it stands we'll have
> > > multiple CPUs scribbling on buf[] at the same time and all trying to
> > > print the same thing at the same time, dumping their stacks, etc.
> > > Perhaps it would be better to single-thread all that stuff
> >
> > My fist patch took the spinlock at the beginning of panic(). But then
> > Eric asked, if it wouldn't be better to get both panic printk's and I
> > agreed.
>
> Hm, why? It will make a big mess.

@Andrew:

I thought it would be good to have both messages and it would be good to
change the panic behavior as less as possible...

But ok, I have no problem with getting the lock at the beginning of
panic(). Below, I attached the updated patch.

> > > Also... this patch affects all CPU architectures, all configs, etc.
> > > So we're expecting that every architecture's smp_send_stop() is able to
> > > stop a CPU which is spinning in spin_lock(), possibly with local
> > > interrupts disabled. Will this work?
> >
> > At least on s390 it will work. If there are architectures that can't
> > stop disabled CPUs then this problem is already there without this
> > patch.
> >
> > Example:
> >
> > 1. 1st CPU gets lock X and panics
> > 2. 2nd CPU is disabled and gets lock X
>
> (irq-disabled)
>
> > 3. 1st CPU calls smp_send_stop()
> > -> 2nd CPU loops disabled and can't be stopped
>
> Well OK. Maybe some architectures do have this problem - who would
> notice? If that is the case, we just made the failure cases much more
> common. Could you check, please?

@linux-arch:

This patch introduces a spinlock to prevent parallel execution of the
panic code. Andrew pointed out that this might be a problem for
architectures that can't do smp_send_stop() on remote CPUs that have
interrupts disabled. When irq-disabled CPUs execute panic() in parallel,
we then would have looping CPUs.

So please speak up if somebody has a problem with this patch!

Michael
---
From: Michael Holzheu <holzheu@xxxxxxxxxxxxxxxxxx>
Subject: kdump: fix crash_kexec()/smp_send_stop() race in panic

When two CPUs call panic at the same time there is a possible race
condition that can stop kdump. The first CPU calls crash_kexec() and the
second CPU calls smp_send_stop() in panic() before crash_kexec() finished
on the first CPU. So the second CPU stops the first CPU and therefore
kdump fails:

1st CPU:
panic()->crash_kexec()->mutex_trylock(&kexec_mutex)-> do kdump

2nd CPU:
panic()->crash_kexec()->kexec_mutex already held by 1st CPU
->smp_send_stop()-> stop 1st CPU (stop kdump)

This patch fixes the problem by introducing a spinlock in panic that
allows only one CPU to process crash_kexec() and the subsequent panic
code.

Signed-off-by: Michael Holzheu <holzheu@xxxxxxxxxxxxxxxxxx>
---
kernel/panic.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -59,6 +59,7 @@ EXPORT_SYMBOL(panic_blink);
*/
NORET_TYPE void panic(const char * fmt, ...)
{
+ static DEFINE_SPINLOCK(panic_lock);
static char buf[1024];
va_list args;
long i, i_next = 0;
@@ -68,8 +69,12 @@ NORET_TYPE void panic(const char * fmt,
* It's possible to come here directly from a panic-assertion and
* not have preempt disabled. Some functions called from here want
* preempt to be disabled. No point enabling it later though...
+ *
+ * Only one CPU is allowed to execute the panic code from here. For
+ * multiple parallel invocations of panic all other CPUs will wait on
+ * the panic_lock. They are stopped afterwards by smp_send_stop().
*/
- preempt_disable();
+ spin_lock_irq(&panic_lock);

console_verbose();
bust_spinlocks(1);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/