RE: [V5 PATCH 2/4] panic/x86: Allow cpus to save registers even if they are looping in NMI context

From: æåèå / KAWAIïHIDEHIRO
Date: Wed Nov 25 2015 - 04:46:51 EST


> On Wed, Nov 25, 2015 at 05:51:59AM +0000, æåèå / KAWAIïHIDEHIRO wrote:
> > > > `Infinite loop in NMI context' can happen:
> > > >
> > > > a. when a cpu panics on NMI while another cpu is processing panic
> > > > b. when a cpu received an external or unknown NMI while another
> > > > cpu is processing panic on NMI
> > > >
> > > > In the case of a, it loops in panic_smp_self_stop(). In the case
> > > > of b, it loops in raw_spin_lock() of nmi_reason_lock.
> > >
> > > Please describe those two cases more verbosely - it takes slow people
> > > like me a while to figure out what exactly can happen.
> >
> > a. when a cpu panics on NMI while another cpu is processing panic
> > Ex.
> > CPU 0 CPU 1
> > ================= =================
> > panic()
> > panic_cpu <-- 0
> > check panic_cpu
> > crash_kexec()
> > receive an unknown NMI
> > unknown_nmi_error()
> > nmi_panic()
> > panic()
> > check panic_cpu
> > panic_smp_self_stop()
> > infinite loop in NMI context
> >
> > b. when a cpu received an external or unknown NMI while another
> > cpu is processing panic on NMI
> > Ex.
> > CPU 0 CPU 1
> > ====================== ==================
> > receive an unknown NMI
> > unknown_nmi_error()
> > nmi_panic() receive an unknown NMI
> > panic_cpu <-- 0 unknown_nmi_error()
> > panic() nmi_panic()
> > check panic_cpu panic
> > crash_kexec() check panic_cpu
> > panic_smp_self_stop()
> > infinite loop in NMI context
>
> Ok, that's what I saw too, thanks for confirming.
>
> But please write those examples with the *old* code in the commit
> message, i.e. without panic_cpu and nmi_panic() which you're adding.

Does *old* code mean the code without this patch *series* ?
panic_cpu and nmi_panic() are introduced by PATCH 1/4, not this patch.

> Basically, you want to structure your commit message this way:
>
> This is the problem the current code has: ...
>
> But we need to do this: ...
>
> We fix it by doing that: ...

Good suggestion! I'll revise a bit with following your comment.

> > > > + * directly. This function is used when we have already been in NMI handler.
> > > > + */
> > > > +void poll_crash_ipi_and_callback(struct pt_regs *regs)
> > >
> > > Why "poll"? We won't return from crash_nmi_callback() if we're not the
> > > crashing CPU.
> >
> > This function polls that crash IPI has been issued by checking
> > crash_ipi_done, then invokes the callback. This is different
> > from so-called "poll" functions. Do you have some good name?
>
> Maybe something as simple as "run_crash_callback"?

I prefer this, but we might want to add some more prefix or suffix.
For example, "conditionally_run_crash_nmi_callback".

> Or since we're calling it from other places, maybe add the "crash"
> prefix:
>
> while (!raw_spin_trylock(&nmi_reason_lock))
> crash_run_callback(regs);
>
> Looks even better to me in code context. :)

Thanks for your deep review!

--
Hidehiro Kawai
Hitachi, Ltd. Research & Development Group


N‹§²æ¸›yú²X¬¶ÇvØ–)Þ{.nlj·¥Š{±‘êX§¶›¡Ü}©ž²ÆzÚj:+v‰¨¾«‘êZ+€Êzf£¢·hšˆ§~†­†Ûÿû®w¥¢¸?™¨è&¢)ßf”ùy§m…á«a¶Úÿ 0¶ìå