Re: [PATCH] use x86 cpu park to speedup smp_init in kexec situation

From: David Woodhouse
Date: Tue Feb 16 2021 - 08:54:35 EST


On Fri, 2021-02-12 at 18:30 +0100, Thomas Gleixner wrote:
> On Fri, Feb 12 2021 at 01:29, Thomas Gleixner wrote:
> > On Thu, Feb 11 2021 at 22:58, David Woodhouse wrote:
> > I have no problem with making that jump based. It does not matter at
> > all. But you can't use the idle task stack before the CR3 switch in
> > secondary_startup_64 - unless I'm missing something.
> >
> > And there's more than 'call verify_cpu' on the way which uses stack. Let
> > me stare more tomorrow with brain awake.
>
> The below boots on top of mainline. It probably breaks i386 and XEN and
> whatever :)

Nice. As discussed on IRC, you'll need more than 0xFFFF for APIC IDs
but I think you can probably get away with 0xFFFFF for physical APIC ID
since nothing more than that can fit in 32 bits of *logical* X2APIC ID.

> I didn't come around to play with your patches yet and won't be able to
> do so before next week.

I threw it into my tree at
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/parallel

It seems to work fairly nicely. The parallel SIPI seems to win be about
a third of the bringup time on my 28-thread Haswell box. This is at the
penultimate commit of the above branch:

[ 0.307590] smp: Bringing up secondary CPUs ...
[ 0.307826] x86: Booting SMP configuration:
[ 0.307830] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14
[ 0.376677] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
[ 0.377177] #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27
[ 0.402323] Brought CPUs online in 246691584 cycles
[ 0.402323] smp: Brought up 1 node, 28 CPUs

... and this is the tip of the branch:

[ 0.308332] smp: Bringing up secondary CPUs ...<dwmw2_gone>
[ 0.308569] x86: Booting SMP configuration:
[ 0.308572] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27
[ 0.321120] Brought 28 CPUs to x86/cpu:kick in 34828752 cycles
[ 0.366663] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
[ 0.368749] Brought CPUs online in 124913032 cycles
[ 0.368749] smp: Brought up 1 node, 28 CPUs
[ 0.368749] smpboot: Max logical packages: 1
[ 0.368749] smpboot: Total of 28 processors activated (145259.85 BogoMIPS)

There's more to be gained here if we can fix up the next stage. Right
now if I set every CPU's bit in cpu_initialized_mask to allow them to
proceed from wait_for_master_cpu() through to the end of cpu_init() and
onwards through start_secondary(), they all end up hitting
check_tsc_sync_target() in parallel and it goes horridly wrong.

A simple answer might be to have another synchronisation point right
before the TSC sync, to force them to do it one at a time but *without*
having to wait for the rest of the AP bringup in series.

Other answers include inventing a magical parallel TSC sync algorithm
that isn't 1-to-1 between the BSP and each AP. Or observing that we're
running in KVM or after kexec, and we *know* they're in sync anyway and
bypassing the whole bloody nonsense.

Recall, my initial debug patch timing the four stages of the bringup,
which I've left in my branch above for visibility, was showing this
kind of output with the original serial bringup...

> [ 0.285312] CPU#10 up in 192950, 952898, 60014786, 28 ( 61160662)
> [ 0.288311] CPU#11 up in 181092, 962704, 60010432, 30 ( 61154258)
> [ 0.291312] CPU#12 up in 386080, 970416, 60013956, 28 ( 61370480)
> [ 0.294311] CPU#13 up in 372782, 964506, 60010564, 28 ( 61347880)
> [ 0.297312] CPU#14 up in 389602, 976280, 60013046, 28 ( 61378956)
> [ 0.300312] CPU#15 up in 213132, 968148, 60012138, 28 ( 61193446)

So that's almost a million cycles per CPU of do_wait_cpu_callin()
before we get to the TSC sync (which is insanely expensive on KVM).
It's something like three times the time taken by the SIPI+wait in the
first phase, which is what we've parallelised so far (at a gain of
about 33%).



> ---
> arch/x86/include/asm/realmode.h | 3 +
> arch/x86/include/asm/smp.h | 9 +++-
> arch/x86/kernel/acpi/sleep.c | 1
> arch/x86/kernel/apic/apic.c | 2
> arch/x86/kernel/head_64.S | 71 +++++++++++++++++++++++++++++++++++
> arch/x86/kernel/smpboot.c | 19 ++++++++-
> arch/x86/realmode/init.c | 3 +
> arch/x86/realmode/rm/trampoline_64.S | 14 ++++++
> kernel/smpboot.c | 2
> 9 files changed, 119 insertions(+), 5 deletions(-)
>
> --- a/arch/x86/include/asm/realmode.h
> +++ b/arch/x86/include/asm/realmode.h
> @@ -51,6 +51,7 @@ struct trampoline_header {
> u64 efer;
> u32 cr4;
> u32 flags;
> + u32 lock;
> #endif
> };
>
> @@ -64,6 +65,8 @@ extern unsigned long initial_stack;
> extern unsigned long initial_vc_handler;
> #endif
>
> +extern u32 *trampoline_lock;
> +
> extern unsigned char real_mode_blob[];
> extern unsigned char real_mode_relocs[];
>
> --- a/arch/x86/include/asm/smp.h
> +++ b/arch/x86/include/asm/smp.h
> @@ -187,5 +187,12 @@ extern void nmi_selftest(void);
> #define nmi_selftest() do { } while (0)
> #endif
>
> -#endif /* __ASSEMBLY__ */
> +extern unsigned int smpboot_control;
> +
> +#endif /* !__ASSEMBLY__ */
> +
> +/* Control bits for startup_64 */
> +#define STARTUP_USE_APICID 0x10000
> +#define STARTUP_USE_CPUID_0B 0x20000
> +
> #endif /* _ASM_X86_SMP_H */
> --- a/arch/x86/kernel/acpi/sleep.c
> +++ b/arch/x86/kernel/acpi/sleep.c
> @@ -114,6 +114,7 @@ int x86_acpi_suspend_lowlevel(void)
> early_gdt_descr.address =
> (unsigned long)get_cpu_gdt_rw(smp_processor_id());
> initial_gs = per_cpu_offset(smp_processor_id());
> + smpboot_control = 0;
> #endif
> initial_code = (unsigned long)wakeup_long64;
> saved_magic = 0x123456789abcdef0L;
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -2333,7 +2333,7 @@ static int nr_logical_cpuids = 1;
> /*
> * Used to store mapping between logical CPU IDs and APIC IDs.
> */
> -static int cpuid_to_apicid[] = {
> +int cpuid_to_apicid[] = {
> [0 ... NR_CPUS - 1] = -1,
> };
>
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -25,6 +25,7 @@
> #include <asm/export.h>
> #include <asm/nospec-branch.h>
> #include <asm/fixmap.h>
> +#include <asm/smp.h>
>
> /*
> * We are not able to switch in one step to the final KERNEL ADDRESS SPACE
> @@ -177,6 +178,64 @@ SYM_INNER_LABEL(secondary_startup_64_no_
> UNWIND_HINT_EMPTY
>
> /*
> + * Is this the boot CPU coming up? If so everything is available
> + * in initial_gs, initial_stack and early_gdt_descr.
> + */
> + movl smpboot_control(%rip), %eax
> + testl %eax, %eax
> + jz .Lsetup_cpu
> +
> + /*
> + * Secondary CPUs find out the offsets via the APIC ID. For parallel
> + * boot the APIC ID is retrieved from CPUID, otherwise it's encoded
> + * in smpboot_control:
> + * Bit 0-15 APICID if STARTUP_USE_CPUID_0B is not set
> + * Bit 16 Secondary boot flag
> + * Bit 17 Parallel boot flag
> + */
> + testl $STARTUP_USE_CPUID_0B, %eax
> + jz .Lsetup_AP
> +
> + mov $0x0B, %eax
> + xorl %ecx, %ecx
> + cpuid
> + mov %edx, %eax
> +
> +.Lsetup_AP:
> + /* EAX contains the APICID of the current CPU */
> + andl $0xFFFF, %eax
> + xorl %ecx, %ecx
> + leaq cpuid_to_apicid(%rip), %rbx
> +
> +.Lfind_cpunr:
> + cmpl (%rbx), %eax
> + jz .Linit_cpu_data
> + addq $4, %rbx
> + addq $8, %rcx
> + jmp .Lfind_cpunr
> +
> +.Linit_cpu_data:
> + /* Get the per cpu offset */
> + leaq __per_cpu_offset(%rip), %rbx
> + addq %rcx, %rbx
> + movq (%rbx), %rbx
> + /* Save it for GS BASE setup */
> + movq %rbx, initial_gs(%rip)
> +
> + /* Calculate the GDT address */
> + movq $gdt_page, %rcx
> + addq %rbx, %rcx
> + movq %rcx, early_gdt_descr_base(%rip)
> +
> + /* Find the idle task stack */
> + movq $idle_threads, %rcx
> + addq %rbx, %rcx
> + movq (%rcx), %rcx
> + movq TASK_threadsp(%rcx), %rcx
> + movq %rcx, initial_stack(%rip)
> +
> +.Lsetup_cpu:
> + /*
> * We must switch to a new descriptor in kernel space for the GDT
> * because soon the kernel won't have access anymore to the userspace
> * addresses where we're currently running on. We have to do that here
> @@ -216,6 +275,14 @@ SYM_INNER_LABEL(secondary_startup_64_no_
> */
> movq initial_stack(%rip), %rsp
>
> + /* Drop the realmode protection. For the boot CPU the pointer is NULL! */
> + movq trampoline_lock(%rip), %rax
> + testq %rax, %rax
> + jz .Lsetup_idt
> + lock
> + btrl $0, (%rax)
> +
> +.Lsetup_idt:
> /* Setup and Load IDT */
> pushq %rsi
> call early_setup_idt
> @@ -347,6 +414,7 @@ SYM_DATA(initial_vc_handler, .quad handl
> * reliably detect the end of the stack.
> */
> SYM_DATA(initial_stack, .quad init_thread_union + THREAD_SIZE - SIZEOF_PTREGS)
> +SYM_DATA(trampoline_lock, .quad 0);
> __FINITDATA
>
> __INIT
> @@ -573,6 +641,9 @@ SYM_DATA(early_gdt_descr, .word GDT_ENT
> SYM_DATA_LOCAL(early_gdt_descr_base, .quad INIT_PER_CPU_VAR(gdt_page))
>
> .align 16
> +SYM_DATA(smpboot_control, .long 0)
> +
> + .align 16
> /* This must match the first entry in level2_kernel_pgt */
> SYM_DATA(phys_base, .quad 0x0)
> EXPORT_SYMBOL(phys_base)
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1021,6 +1021,16 @@ int common_cpu_up(unsigned int cpu, stru
> return 0;
> }
>
> +static void setup_smpboot_control(unsigned int apicid)
> +{
> +#ifdef CONFIG_X86_64
> + if (boot_cpu_data.cpuid_level < 0x0B)
> + smpboot_control = apicid | STARTUP_USE_APICID;
> + else
> + smpboot_control = STARTUP_USE_CPUID_0B;
> +#endif
> +}
> +
> /*
> * NOTE - on most systems this is a PHYSICAL apic ID, but on multiquad
> * (ie clustered apic addressing mode), this is a LOGICAL apic ID.
> @@ -1037,9 +1047,14 @@ static int do_boot_cpu(int apicid, int c
> unsigned long timeout;
>
> idle->thread.sp = (unsigned long)task_pt_regs(idle);
> - early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu);
> initial_code = (unsigned long)start_secondary;
> - initial_stack = idle->thread.sp;
> +
> + if (IS_ENABLED(CONFIG_X86_32)) {
> + early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(cpu);
> + initial_stack = idle->thread.sp;
> + }
> +
> + setup_smpboot_control(apicid);
>
> /* Enable the espfix hack for this CPU */
> init_espfix_ap(cpu);
> --- a/arch/x86/realmode/init.c
> +++ b/arch/x86/realmode/init.c
> @@ -125,6 +125,9 @@ static void __init setup_real_mode(void)
>
> trampoline_header->flags = 0;
>
> + trampoline_lock = &trampoline_header->lock;
> + *trampoline_lock = 0;
> +
> trampoline_pgd = (u64 *) __va(real_mode_header->trampoline_pgd);
> trampoline_pgd[0] = trampoline_pgd_entry.pgd;
> trampoline_pgd[511] = init_top_pgt[511].pgd;
> --- a/arch/x86/realmode/rm/trampoline_64.S
> +++ b/arch/x86/realmode/rm/trampoline_64.S
> @@ -49,6 +49,19 @@ SYM_CODE_START(trampoline_start)
> mov %ax, %es
> mov %ax, %ss
>
> + /*
> + * Make sure only one CPU fiddles with the realmode stack
> + */
> +.Llock_rm:
> + btw $0, tr_lock
> + jnc 2f
> + pause
> + jmp .Llock_rm
> +2:
> + lock
> + btsw $0, tr_lock
> + jc .Llock_rm
> +
> # Setup stack
> movl $rm_stack_end, %esp
>
> @@ -192,6 +205,7 @@ SYM_DATA_START(trampoline_header)
> SYM_DATA(tr_efer, .space 8)
> SYM_DATA(tr_cr4, .space 4)
> SYM_DATA(tr_flags, .space 4)
> + SYM_DATA(tr_lock, .space 4)
> SYM_DATA_END(trampoline_header)
>
> #include "trampoline_common.S"
> --- a/kernel/smpboot.c
> +++ b/kernel/smpboot.c
> @@ -25,7 +25,7 @@
> * For the hotplug case we keep the task structs around and reuse
> * them.
> */
> -static DEFINE_PER_CPU(struct task_struct *, idle_threads);
> +DEFINE_PER_CPU(struct task_struct *, idle_threads);
>
> struct task_struct *idle_thread_get(unsigned int cpu)
> {

Attachment: smime.p7s
Description: S/MIME cryptographic signature