Re: [PATCH v3 1/1] ppc/crash: Reset spinlocks during crash
From: Leonardo Bras
Date: Thu Apr 02 2020 - 20:40:17 EST
On Thu, 2020-04-02 at 22:28 +1100, Michael Ellerman wrote:
> Leonardo Bras <leonardo@xxxxxxxxxxxxx> writes:
> > During a crash, there is chance that the cpus that handle the NMI IPI
> > are holding a spin_lock. If this spin_lock is needed by crashing_cpu it
> > will cause a deadlock. (rtas.lock and printk logbuf_lock as of today)
> >
> > This is a problem if the system has kdump set up, given if it crashes
> > for any reason kdump may not be saved for crash analysis.
> >
> > After NMI IPI is sent to all other cpus, force unlock all spinlocks
> > needed for finishing crash routine.
>
> I'm not convinced this is the right approach.
Me neither. I think it's a very hacky solution, but I couldn't think of
anything better at the time.
> Busting locks is risky, it could easily cause a crash if data structures
> are left in some inconsistent state.
>
> I think we need to make this code more careful about what it's doing.
> There's a clue at the top of default_machine_crash_shutdown(), which
> calls crash_kexec_prepare_cpus():
>
> * This function is only called after the system
> * has panicked or is otherwise in a critical state.
> * The minimum amount of code to allow a kexec'd kernel
> * to run successfully needs to happen here.
>
>
> You said the "IPI complete" message was the cause of one lockup:
>
> #0 arch_spin_lock
> #1 do_raw_spin_lock
> #2 __raw_spin_lock
> #3 _raw_spin_lock
> #4 vprintk_emit
> #5 vprintk_func
> #7 crash_kexec_prepare_cpus
> #8 default_machine_crash_shutdown
> #9 machine_crash_shutdown
> #10 __crash_kexec
> #11 crash_kexec
> #12 oops_end
>
> TBH I think we could just drop that printk() entirely.
>
> Or we could tell printk() that we're in NMI context so that it uses the
> percpu buffers.
>
> We should probably do the latter anyway, in case there's any other code
> we call that inadvertently calls printk().
>
I was not aware of using per-cpu buffers in printk. It may be a nice
solution.
There is another printk call there:
printk("kexec: Starting switchover sequence.\n");
in default_machine_kexec().
Two printk and one rtas call: it's all I could see using a spinlock
after IPI Complete.
>
> The RTAS trace you sent was:
>
> #0 arch_spin_lock
> #1 lock_rtas ()
> #2 rtas_call (token=8204, nargs=1, nret=1, outputs=0x0)
> #3 ics_rtas_mask_real_irq (hw_irq=4100)
> #4 machine_kexec_mask_interrupts
> #5 default_machine_crash_shutdown
> #6 machine_crash_shutdown
> #7 __crash_kexec
> #8 crash_kexec
> #9 oops_end
>
>
> Which doesn't make it clear who holds the RTAS lock. We really shouldn't
> be crashing while holding the RTAS lock, but I guess it could happen.
> Can you get a full backtrace?
>
Oh, all traces are from the thread that called the crash, by writing
'c' to sysrq. That is what I am using to reproduce.
#10 bad_page_fault
#11 handle_page_fault
#12 __handle_sysrq (key=99, check_mask=false)
#13 write_sysrq_trigger
#14 proc_reg_write
#15 __vfs_write
#16 vfs_write
#17 SYSC_write
#18 SyS_write
#19 system_call
>
> PAPR says we are not allowed to have multiple CPUs calling RTAS at once,
> except for a very small list of RTAS calls. So if we bust the RTAS lock
> there's a risk we violate that part of PAPR and crash even harder.
Interesting, I was not aware.
>
> Also it's not specific to kdump, we can't even get through a normal
> reboot if we crash with the RTAS lock held.
>
> Anyway here's a patch with some ideas. That allows me to get from a
> crash with the RTAS lock held through kdump into the 2nd kernel. But it
> only works if it's the crashing CPU that holds the RTAS lock.
>
Nice idea.
But my test environment is just triggering a crash from sysrq, so I
think it would not improve the result, given that this thread is
probably not holding the lock by the time.
I noticed that when rtas is locked, irqs and preemption are also
disabled.
Should the IPI send by crash be able to interrupt a thread with
disabled irqs?
Best regards,
Leonardo Bras
> cheers
>
> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> index c5fa251b8950..44ce74966d60 100644
> --- a/arch/powerpc/kernel/rtas.c
> +++ b/arch/powerpc/kernel/rtas.c
> @@ -25,6 +25,7 @@
> #include <linux/reboot.h>
> #include <linux/syscalls.h>
>
> +#include <asm/debugfs.h>
> #include <asm/prom.h>
> #include <asm/rtas.h>
> #include <asm/hvcall.h>
> @@ -65,6 +66,8 @@ unsigned long rtas_rmo_buf;
> void (*rtas_flash_term_hook)(int);
> EXPORT_SYMBOL(rtas_flash_term_hook);
>
> +static int rtas_lock_holder = -1;
> +
> /* RTAS use home made raw locking instead of spin_lock_irqsave
> * because those can be called from within really nasty contexts
> * such as having the timebase stopped which would lockup with
> @@ -76,7 +79,20 @@ static unsigned long lock_rtas(void)
>
> local_irq_save(flags);
> preempt_disable();
> - arch_spin_lock(&rtas.lock);
> +
> + if (!arch_spin_trylock(&rtas.lock)) {
> + // Couldn't get the lock, do we already hold it?
> + if (rtas_lock_holder == smp_processor_id())
> + // Yes, so we would have deadlocked on ourself. Assume
> + // we're crashing and continue on hopefully ...
> + return flags;
> +
> + // No, wait on the lock
> + arch_spin_lock(&rtas.lock);
> + }
> +
> + rtas_lock_holder = smp_processor_id();
> +
> return flags;
> }
>
> @@ -85,6 +101,8 @@ static void unlock_rtas(unsigned long flags)
> arch_spin_unlock(&rtas.lock);
> local_irq_restore(flags);
> preempt_enable();
> +
> + rtas_lock_holder = -1;
> }
>
> /*
> @@ -1263,3 +1281,24 @@ void rtas_take_timebase(void)
> timebase = 0;
> arch_spin_unlock(&timebase_lock);
> }
> +
> +static int rtas_crash_set(void *data, u64 val)
> +{
> + printk("%s: Taking RTAS lock and then crashing ...\n", __func__);
> + lock_rtas();
> +
> + *((volatile int *) 0) = 0;
> +
> + return 0;
> +}
> +
> +DEFINE_DEBUGFS_ATTRIBUTE(fops_rtas_crash, NULL, rtas_crash_set, "%llu\n");
> +
> +static __init int rtas_crash_debugfs_init(void)
> +{
> + debugfs_create_file_unsafe("crash_in_rtas", 0200,
> + powerpc_debugfs_root, NULL,
> + &fops_rtas_crash);
> + return 0;
> +}
> +device_initcall(rtas_crash_debugfs_init);
> diff --git a/arch/powerpc/kexec/crash.c b/arch/powerpc/kexec/crash.c
> index d488311efab1..4c52cb58e889 100644
> --- a/arch/powerpc/kexec/crash.c
> +++ b/arch/powerpc/kexec/crash.c
> @@ -15,6 +15,7 @@
> #include <linux/crash_dump.h>
> #include <linux/delay.h>
> #include <linux/irq.h>
> +#include <linux/printk.h>
> #include <linux/types.h>
>
> #include <asm/processor.h>
> @@ -311,6 +312,8 @@ void default_machine_crash_shutdown(struct pt_regs *regs)
> unsigned int i;
> int (*old_handler)(struct pt_regs *regs);
>
> + printk_nmi_enter();
> +
> /*
> * This function is only called after the system
> * has panicked or is otherwise in a critical state.
>
>
Attachment:
signature.asc
Description: This is a digitally signed message part