RE: Re: [V3 PATCH 1/2] x86/panic: Replace smp_send_stop() with kdump friendly version

From: æåèå / KAWAIïHIDEHIRO
Date: Fri Jul 15 2016 - 07:52:10 EST


Hi Dave,

Thanks for your reply.

> From: 'Dave Young' [mailto:dyoung@xxxxxxxxxx]
> Sent: Wednesday, July 13, 2016 11:04 AM
>
> On 07/12/16 at 02:49am, æåèå / KAWAIïHIDEHIRO wrote:
> > Hi Dave,
> >
> > Thanks for the comments.
> >
> > > From: Dave Young [mailto:dyoung@xxxxxxxxxx]
> > > Sent: Monday, July 11, 2016 5:35 PM
> > >
> > > On 07/05/16 at 08:33pm, Hidehiro Kawai wrote:
> > > > This patch fixes one of the problems reported by Daniel Walker
> > > > (https://lkml.org/lkml/2015/6/24/44).
> > > >
> > > > If crash_kexec_post_notifiers boot option is specified, other CPUs
> > > > are stopped by smp_send_stop() instead of machine_crash_shutdown()
> > > > in crash_kexec() path. This behavior change leads two problems.
> > > >
> > > > Problem 1:
> > > > octeon_generic_shutdown() for MIPS OCTEON assumes that other CPUs
> > > > are still online and try to stop their watchdog timer. If
> > > > smp_send_stop() is called before octeon_generic_shutdown(),
> > > > stopping watchdog timer will fail because other CPUs have been
> > > > offlined by smp_send_stop().
> > > >
> > > > panic()
> > > > if crash_kexec_post_notifiers == 1
> > > > smp_send_stop()
> > > > atomic_notifier_call_chain()
> > > > kmsg_dump()
> > > > crash_kexec()
> > > > machine_crash_shutdown()
> > > > octeon_generic_shutdown() // shutdown watchdog for ONLINE
> > > > CPUs
> > > >
> > > > Problem 2:
> > > > Most of architectures stop other CPUs in machine_crash_shutdown()
> > > > path, and they also do something needed for kdump. For example,
> > > > they save registers, disable virtualization extensions, and so on.
> > > > However, if smp_send_stop() stops other CPUs before
> > > > machine_crash_shutdown(), we miss those operations.
> > > >
> > > > How do we fix these problems? In the first place, we should stop
> > > > other CPUs as soon as possible when panic() was called, otherwise
> > > > other CPUs may wipe out a clue to the cause of the failure. So,
> > > > we replace smp_send_stop() with more suitable one for kdump.
> > >
> > > We have been avoiding extra things in panic path, but unfortunately
> > > crash_kexec_post_notifiers were added. I tend to agree the best
> > > place for this stuff is in 2nd kernel or purgatory instead of in 1st kernel.
> >
> > Several months ago, I posted a patch set which writes regs to SEL,
> > generate an event to send SNMP message, and start/stop BMC's watchdog
> > timer in purgatory. This feature requires BMC with KCS (Keyboard
> > Controller Style) I/F, but the most of enterprise grade server would have it.
> > (http://thread.gmane.org/gmane.linux.kernel.kexec/15382)
> >
> > Doing kmsg_dump things in purgatory wouldn't be suitable (should be
> > done in the 2nd kernel before enabling devices and IRQs?)
>
> In theory it is doable maybe do something like oldmem_kmsg_dump while /proc/vmcore initializing?

Hmm, I checked the case of using ACPI ERST as a persistent
storage. The feature is initialized by device_initcall():

device_initcall
erst_init
pstore_register

And vmcore is initialized by fs_initcall() which is called
before device_initcall(). We may be able to change the sequence,
but anyway, these are done in later phase of the kernel initialization.
So, it may get less reliable although it would be doable.

> > > As for this patch I'm not sure it is safe to replace the
> > > smp_send_stop with the kdump friendly function. I'm also not sure if
> > > the kdump friendly function is safe for kdump. Will glad to hear
> > > opinions from other arch experts.
> >
> > This stuff depends on architectures, so I speak only about
> > x86 (the logic doesn't change on other architectures at this time).
> >
> > kdump path with crash_kexec_post_notifiers disabled:
> > panic()
> > __crash_kexec()
> > crash_setup_regs()
> > crash_save_vmcoreinfo()
> > machine_crash_shutdown()
> > native_machine_crash_shutdown()
> > panic_smp_send_stop() /* mostly same as original
> > * kdump_nmi_shootdown_cpus()
> > */
> >
> > kdump path with crash_kexec_post_notifiers enabled:
> > panic()
> > panic_smp_send_stop()
> > __crash_kexec()
> > crash_setup_regs()
> > crash_save_vmcoreinfo()
> > machine_crash_shutdown()
> > native_machine_crash_shutdown()
> > panic_smp_send_stop() // do nothing
> >
> > The difference is that stopping other CPUs before crash_setup_regs()
> > and crash_save_vmcoreinfo() or not. Since crash_setup_regs() and
> > crash_save_vmcoreinfo() just save information to some memory area,
> > they wouldn't be affected by panic_smp_send_stop(). This means
> > placing panic_smp_send_stop before __crash_kexec is safe.
> >
> > BTW, I noticed my patch breaks Xen kernel. I'll fix it in the next
> > version.
>
> But it does breaks stuff which depends on cpu not being disabled like problem 1 you mentioned in patch log.

As I mentioned in the description of this patch, we should stop
other CPUs ASAP to preserve current state either
crash_kexec_post_notifiers is enabled or not.
Then, all remaining procedures should work well
after stopping other CPUs (but keep the CPU map online).

Vivek also mentioned similar things:
https://lkml.org/lkml/2015/7/14/433

> > > BTW, if one want to use crash_kexec_post_notifiers he should take
> > > the risk of unreliable kdump. How about only call smp_send_stop in
> > > case no crash_kexec_post_notifiers being used.
> >
> > Unlike panic_smp_send_stop()/kdump_nmi_shootdown_cpus(),
> > smp_send_stop() for x86 tries to stop other CPUs with normal IPI before issuing NMI IPI.
> > This would be because NMI IPI has a risk of deadlock. We checked if
> > the kdump path has a risk of deadlock in the case of NMI panic and
> > fixed it. But I'm not sure about normal panic path. I agree with
> > that use smp_send_stop if crash_kexec_post_notifiers or kdump is disabled.
>
> What I mean is like below, problem 1 will not exist in this way, but kdump will be unreliable:
> if (!crash_kexec_post_notifiers)
> smp_send_stop()

Remaining procedures including atomic_notifier_call_chain and
kmsg_dump may assume that other CPUs have stopped. Actually,
IPMI driver callback assumes so that. Also, other CPUs may
change the current state while calling these callbacks and make
the dump analysis difficult.

[snip]

Best regards,

Hidehiro Kawai