Re: [RFC PATCH] Bug during kexec...not all cpus are stopped

From: Eric W. Biederman
Date: Mon Oct 11 2010 - 17:17:39 EST


Alok Kataria <akataria@xxxxxxxxxx> writes:

> Hi Eric,
>
> Thanks for taking a look.
>
> On Mon, 2010-10-11 at 11:07 -0700, Eric W. Biederman wrote:
>> Alok Kataria <akataria@xxxxxxxxxx> writes:
>>
>> > Copying a few more maintainers on the thread...
>> > On Fri, 2010-10-08 at 13:34 -0700, Alok Kataria wrote:
>> >> Before starting the new kernel kexec calls machine_shutdown to stop all
>> >> the cpus, which internally calls native_smp_send_stop. AFAIU, kexec
>> >> expects that all the cpus are now halted after that call returns.
>> >> Now, looking at the code for native_smp_send_stop, it assumes that all
>> >> the processors have processed the REBOOT ipi in 1 second after the IPI
>> >> was sent.
>> >> native_smp_send_stop()
>> >> ---------------------------------------------------------
>> >> apic->send_IPI_allbutself(REBOOT_VECTOR);
>> >>
>> >> /* Don't wait longer than a second */
>> >> wait = USEC_PER_SEC;
>> >> while (num_online_cpus() > 1 && wait--)
>> >> udelay(1);
>> >> ---------------------------------------------------------
>> >>
>> >> It just returns after that 1 second irrespective of whether all cpus
>> >> were halted or not. This brings up a issue in the kexec case, since we
>> >> can have the BSP starting the new kernel and AP's still processing the
>> >> REBOOT IPI simultaneously.
>> >>
>> >> Many distribution kernels use kexec to load the newly installed kernel
>> >> during the installation phase, in virtualized environment with the host
>> >> heavily overcommitted, we have seen some instances when vcpu fails to
>> >> process the IPI in the allotted 1 sec and as a result the AP's end up
>> >> accessing uninitialized state (the BSP has already gone ahead with
>> >> setting up the new state) and causing GPF's.
>> >>
>> >> IMO, kexec expects machine_shutdown to return only after all cpus are
>> >> stopped.
>> >>
>> >> The patch below should fix the issue, comments ??
>> >>
>> >> --
>> >> machine_shutdown now takes a parameter "wait", if it is true, it waits
>> >> until all the cpus are halted. All the callers except kexec still
>> >> fallback to the earlier version of the shutdown call, where it just
>> >> waited for max 1 sec before returning.
>>
>> The approach seems reasonable. However on every path except for the
>> panic path I would wait for all of the cpus to stop, as that is what
>> those code paths are expecting. Until last year smp_send_stop always
>> waited until all of the cpus stopped. So I think changing
>> machine_shutdown should not need to happen.
>>
>> This patch cannot be used as is because it changes the parameters to
>> smp_send_stop() and there are more than just x86 implementations out
>> there.
>>
>> Let me propose a slightly different tactic. We need to separate
>> the panic path from the normal path to avoid confusion. At the
>> generic level smp_send_stop is exclusively used for the panic
>> path. So we should keep that, and rename the x86 non-panic path
>> function something else.
>>
>> Can you rename the x86 smp_send_stop function say stop_all_other_cpus().
>>
>> At which point we could implement smp_send_stop as:
>> stop_all_other_cpus(0);
>>
>> And everywhere else would call stop_all_other_cpus(1) waiting for
>> the cpus to actually stop.
>
> Done, I have added a stop_other_cpus function which calls
> smp_ops.stop_other_cpus(1)
>
>>
>> I really think we want to wait for the cpus to stop in all cases except
>> for panic/kdump where we expect things to be broken and we are doing
>> our best to make things work anyway.
> Now it does, except in the panic case.
>
> Jeremy, I have modified the xen_reboot bits too so that it now waits
> until all cpus are actually stopped, please take a look.
>
> --
>
> x86 smp_ops now has a new op, stop_other_cpus which takes a parameter "wait"
> this allows the caller to specify if it wants to stop untill all the cpus
> have processed the stop IPI. This is required specifically for the kexec case
> where we should wait for all the cpus to be stopped before starting the new
> kernel.
> We now wait for the cpus to stop in all cases except for panic/kdump where
> we expect things to be broken and we are doing our best to make things
> work anyway.

One little nit.


> Signed-off-by: Alok N Kataria <akataria@xxxxxxxxxx>
> Cc: Eric W. Biederman <ebiederm@xxxxxxxxxxxx>
> Cc: Jeremy Fitzhardinge <jeremy@xxxxxxxxxxxxx>
>
> Index: linux-2.6/arch/x86/include/asm/smp.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/smp.h 2010-10-11 11:56:40.000000000 -0700
> +++ linux-2.6/arch/x86/include/asm/smp.h 2010-10-11 12:00:22.000000000 -0700
> @@ -50,7 +50,7 @@ struct smp_ops {
> void (*smp_prepare_cpus)(unsigned max_cpus);
> void (*smp_cpus_done)(unsigned max_cpus);
>
> - void (*smp_send_stop)(void);
> + void (*stop_other_cpus)(int wait);
> void (*smp_send_reschedule)(int cpu);
>
> int (*cpu_up)(unsigned cpu);
> @@ -73,7 +73,12 @@ extern struct smp_ops smp_ops;
>
> static inline void smp_send_stop(void)
> {
> - smp_ops.smp_send_stop();
> + smp_ops.stop_other_cpus(0);
> +}
> +
> +static inline void stop_other_cpus(void)
> +{
> + smp_ops.stop_other_cpus(1);
> }
>
> static inline void smp_prepare_boot_cpu(void)
> Index: linux-2.6/arch/x86/kernel/reboot.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/reboot.c 2010-10-11 11:56:40.000000000 -0700
> +++ linux-2.6/arch/x86/kernel/reboot.c 2010-10-11 12:05:02.000000000 -0700
> @@ -641,7 +641,7 @@ void native_machine_shutdown(void)
> /* O.K Now that I'm on the appropriate processor,
> * stop all of the others.
> */
> - smp_send_stop();
> + stop_other_cpus();
> #endif
>
> lapic_shutdown();
> Index: linux-2.6/arch/x86/kernel/smp.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/smp.c 2010-10-11 11:56:40.000000000 -0700
> +++ linux-2.6/arch/x86/kernel/smp.c 2010-10-11 12:16:42.000000000 -0700
> @@ -159,10 +159,10 @@ asmlinkage void smp_reboot_interrupt(voi
> irq_exit();
> }
>
> -static void native_smp_send_stop(void)
> +static void native_stop_other_cpus(int wait)
> {
> unsigned long flags;
> - unsigned long wait;
> + unsigned long timeout;
>
> if (reboot_force)
> return;
> @@ -179,9 +179,12 @@ static void native_smp_send_stop(void)
> if (num_online_cpus() > 1) {
> apic->send_IPI_allbutself(REBOOT_VECTOR);
>
> - /* Don't wait longer than a second */
> - wait = USEC_PER_SEC;
> - while (num_online_cpus() > 1 && wait--)
> + /*
> + * Don't wait longer than a second if the caller
> + * didn't ask us to wait.
> + */
> + timeout = USEC_PER_SEC;
> + while (num_online_cpus() > 1 && (wait || timeout--))
> udelay(1);
> }
>
> @@ -227,7 +230,7 @@ struct smp_ops smp_ops = {
> .smp_prepare_cpus = native_smp_prepare_cpus,
> .smp_cpus_done = native_smp_cpus_done,
>
> - .smp_send_stop = native_smp_send_stop,
> + .stop_other_cpus = native_stop_other_cpus,
> .smp_send_reschedule = native_smp_send_reschedule,
>
> .cpu_up = native_cpu_up,
> Index: linux-2.6/arch/x86/xen/enlighten.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/xen/enlighten.c 2010-10-11 11:56:40.000000000 -0700
> +++ linux-2.6/arch/x86/xen/enlighten.c 2010-10-11 11:58:53.000000000 -0700
> @@ -1018,7 +1018,7 @@ static void xen_reboot(int reason)
> struct sched_shutdown r = { .reason = reason };
>
> #ifdef CONFIG_SMP
> - smp_send_stop();
> + stop_other_cpus();
> #endif
>
> if (HYPERVISOR_sched_op(SCHEDOP_shutdown, &r))
> Index: linux-2.6/arch/x86/xen/smp.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/xen/smp.c 2010-10-11 11:56:40.000000000 -0700
> +++ linux-2.6/arch/x86/xen/smp.c 2010-10-11 11:58:53.000000000 -0700
> @@ -400,9 +400,9 @@ static void stop_self(void *v)
> BUG();
> }
>
> -static void xen_smp_send_stop(void)
> +static void xen_stop_other_cpus(int wait)
> {
> - smp_call_function(stop_self, NULL, 0);
> + smp_call_function(stop_self, NULL, wait);
> }
>
> static void xen_smp_send_reschedule(int cpu)
> @@ -470,7 +470,7 @@ static const struct smp_ops xen_smp_ops
> .cpu_disable = xen_cpu_disable,
> .play_dead = xen_play_dead,
>
> - .smp_send_stop = xen_smp_send_stop,
> + .stop_other_cpus = xen_stop_other_cpus,
> .smp_send_reschedule = xen_smp_send_reschedule,
>
> .send_call_func_ipi = xen_smp_send_call_function_ipi,
> Index: linux-2.6/include/linux/smp.h
> ===================================================================
> --- linux-2.6.orig/include/linux/smp.h 2010-10-11 11:56:40.000000000 -0700
> +++ linux-2.6/include/linux/smp.h 2010-10-11 12:07:19.000000000 -0700
> @@ -116,7 +116,8 @@ extern unsigned int setup_max_cpus;
>
> #else /* !SMP */
>
> -static inline void smp_send_stop(void) { }
> +static inline void smp_send_stop() { }
> +static inline void stop_other_cpus() { }

You shouldn't change smp.h.

For the moment stop_other_cpus() is x86 specific so we shouldn't
impose it in other architectures.

Also note removing the explicit void is a bad idea in C because that is
equivalent to saying: smp_send_stop(...) { } Which allows any set
of parameters you can think of.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/