Re: [PATCH] arm64: kexec: add support for kexec with spin-table

From: Henry Willard
Date: Wed Jul 14 2021 - 20:08:53 EST


Hi, Mark,
Thanks for reviewing this. I am not in a position to go into too much detail about the particular device, but the u-boot we are using is the u-boot we have to use, at least for now. We would have preferred to have PSCI, but that option is not available. Modifying u-boot is not an option.

It is possible to do this without relying on the spin-table loop. I implemented such a version using the kexec code control page before I got my hands on the device actually using spin-table. That implementaiton needed changes in a lot of places, because the secondary CPUs had to leave the code control page before the boot CPU enters the new kernel. Reusing the spin-table loop simplified things quite a bit.

This has been useful to us, so we thought we would pass it along to see if it is useful to anyone else in the same situation.

> On Jul 14, 2021, at 11:47 AM, Mark Rutland <mark.rutland@xxxxxxx> wrote:
>
> Hi Henry,
>
> On Wed, Jul 14, 2021 at 10:41:13AM -0700, Henry Willard wrote:
>> With one special exception kexec is not supported on systems
>> that use spin-table as the cpu enablement method instead of PSCI.
>> The spin-table implementation lacks cpu_die() and several other
>> methods needed by the hotplug framework used by kexec on Arm64.
>>
>> Some embedded systems may not have a need for the Arm Trusted
>> Firmware, or they may lack it during early bring-up. Some of
>> these may have a more primitive version of u-boot that uses a
>> special device from which to load the kernel. Kexec can be
>> especially useful for testing new kernels in such an environment.
>>
>> What is needed to support kexec is some place for cpu_die to park
>> the secondary CPUs outside the kernel while the primary copies
>> the new kernel into place and starts it. One possibility is to
>> use the control-code-page where arm64_relocate_new_kernel_size()
>> executes, but that requires a complicated and racy dance to get
>> the secondary CPUs from the control-code-page to the new
>> kernel after it has been copied.
>>
>> The spin-table mechanism is setup before the Linux kernel
>> is entered with details provided in the device tree. The
>> "release-address" DT variable provides the address of a word the
>> secondary CPUs are polling. The boot CPU will store the real address
>> of secondary_holding_pen() at that address, and the secondary CPUs
>> will branch to that address. secondary_holding_pen() is another
>> loop where the secondary CPUs wait to be called up by the boot CPU.
>>
>> This patch uses that mechanism to implement cpu_die(). In modern
>> versions of u-boot that implement spin-table, the address of the
>> loop in protected memory can be derived from the "release-address"
>> value. The patch validates the existence of the loop before
>> proceeding. smp_spin_table_cpu_die() uses cpu_soft_restart() to
>> branch to the loop with the MMU and caching turned off where the
>> CPU waits until released by the new kernel. After that kexec
>> reboot proceeds normally.
>
> This isn't true for all spin-table implementations; for example this is
> not safe with the boot-wrapper.
>
> While, I'm not necessarily opposed to providing a mechanism to return a
> CPU back to the spin-table, the presence of that mechanism needs to be
> explicitly defined in the device tree (e.g. with a "cpu-return-addr"
> property or similar), and we need to thoroughly document the contract
> (e.g. what state the CPU is in when it is returned). We've generally
> steered clear of this since it is much more complicated than it may
> initially seem, and there is immense scope for error.
>
> If we do choose to extend spin-table in this way, we'll also need to
> enforce that each cpu has a unique cpu-release-address, or this is
> unsound to begin with (since e.g. the kernel can't return CPUs that it
> doesn't know are stuck in the holding pen). We will also need a
> mechanism to reliably identify when the CPU has been successfully
> returned.
>
> I would very much like to avoid this if possible. U-Boot does have a
> PSCI implementation that some platforms use; is it not possible to use
> this?

Unfortunately, no. If we had that we would never have bothered with this.

>
> If this is for early bringup, and you're using the first kernel as a
> bootloader, I'd suggest that you boot that with "nosmp", such that the
> first kernel doesn't touch the secondary CPUs at all.

The particular case that spawned this is past that. There are a number of reasons why we need to be able to kexec a new kernel. Being able to bypass the kernel installation process, which is a little more complicated than normal, to test a new kernels is an added benefit.

>
>> The special exception is the kdump capture kernel, which gets
>> started even if the secondaries can't be stopped.
>>
>> Signed-off-by: Henry Willard <henry.willard@xxxxxxxxxx>
>> ---
>> arch/arm64/kernel/smp_spin_table.c | 111 +++++++++++++++++++++++++++++++++++++
>> 1 file changed, 111 insertions(+)
>>
>> diff --git a/arch/arm64/kernel/smp_spin_table.c b/arch/arm64/kernel/smp_spin_table.c
>> index 7e1624ecab3c..35c7fa764476 100644
>> --- a/arch/arm64/kernel/smp_spin_table.c
>> +++ b/arch/arm64/kernel/smp_spin_table.c
>> @@ -13,16 +13,27 @@
>> #include <linux/mm.h>
>>
>> #include <asm/cacheflush.h>
>> +#include <asm/daifflags.h>
>> #include <asm/cpu_ops.h>
>> #include <asm/cputype.h>
>> #include <asm/io.h>
>> #include <asm/smp_plat.h>
>> +#include <asm/mmu_context.h>
>> +#include <asm/kexec.h>
>> +
>> +#include "cpu-reset.h"
>>
>> extern void secondary_holding_pen(void);
>> volatile unsigned long __section(".mmuoff.data.read")
>> secondary_holding_pen_release = INVALID_HWID;
>>
>> static phys_addr_t cpu_release_addr[NR_CPUS];
>> +static unsigned int spin_table_loop[4] = {
>> + 0xd503205f, /* wfe */
>> + 0x58000060, /* ldr x0, spin_table_cpu_release_addr */
>> + 0xb4ffffc0, /* cbnz x0, 0b */
>> + 0xd61f0000 /* br x0 */
>> +};
>>
>> /*
>> * Write secondary_holding_pen_release in a way that is guaranteed to be
>> @@ -119,9 +130,109 @@ static int smp_spin_table_cpu_boot(unsigned int cpu)
>> return 0;
>> }
>>
>> +
>> +/*
>> + * There is a four instruction loop set aside in protected
>> + * memory by u-boot where secondary CPUs wait for the kernel to
>> + * start.
>> + *
>> + * 0: wfe
>> + * ldr x0, spin_table_cpu_release_addr
>> + * cbz x0, 0b
>> + * br x0
>> + * spin_table_cpu_release_addr:
>> + * .quad 0
>> + *
>> + * The address of spin_table_cpu_release_addr is passed in the
>> + * "release-address" property in the device table.
>> + * smp_spin_table_cpu_prepare() stores the real address of
>> + * secondary_holding_pen() where the secondary CPUs loop
>> + * until they are released one at a time by smp_spin_table_cpu_boot().
>> + * We reuse the spin-table loop by clearing spin_table_cpu_release_addr,
>> + * and branching to the beginning of the loop via cpu_soft_restart(),
>> + * which turns off the MMU and caching.
>> + */
>> +static void smp_spin_table_cpu_die(unsigned int cpu)
>> +{
>> + __le64 __iomem *release_addr;
>> + unsigned int *spin_table_inst;
>> + unsigned long spin_table_start;
>> +
>> + if (!cpu_release_addr[cpu])
>> + goto spin;
>> +
>> + spin_table_start = (cpu_release_addr[cpu] - sizeof(spin_table_loop));
>> +
>> + /*
>> + * The cpu-release-addr may or may not be inside the linear mapping.
>> + * As ioremap_cache will either give us a new mapping or reuse the
>> + * existing linear mapping, we can use it to cover both cases. In
>> + * either case the memory will be MT_NORMAL.
>> + */
>> + release_addr = ioremap_cache(spin_table_start,
>> + sizeof(*release_addr) +
>> + sizeof(spin_table_loop));
>> +
>> + if (!release_addr)
>> + goto spin;
>> +
>> + spin_table_inst = (unsigned int *)release_addr;
>> + if (spin_table_inst[0] != spin_table_loop[0] ||
>> + spin_table_inst[1] != spin_table_loop[1] ||
>> + spin_table_inst[2] != spin_table_loop[2] ||
>> + spin_table_inst[3] != spin_table_loop[3])
>> + goto spin;
>
> Please don't hard-code a specific sequence for this; if we *really* need
> this, we should be given a cpu-return-addr explicitly, and we should
> simply trust it.

That would require changes to u-boot. The purpose is to detect if we get a new version of u-boot with a different loop. Seems remote since this particular loop has been this way for quite some time, and it works well.

>
>> +
>> + /*
>> + * Clear the release address, so that we can use it again
>> + */
>> + writeq_relaxed(0, release_addr + 2);
>> + dcache_clean_inval_poc((__force unsigned long)(release_addr + 2),
>> + (__force unsigned long)(release_addr + 2) +
>> + sizeof(*release_addr));
>
> What is the `+ 2` for?

Yeah, I could have been clearer. The spin_table_cpu_release_addr variable sits at +0x10 past the spin-table loop.

>
>> +
>> + iounmap(release_addr);
>> +
>> + local_daif_mask();
>> + cpu_soft_restart(spin_table_start, 0, 0, 0);
>> +
>> + BUG(); /* Should never get here */
>> +
>> +spin:
>> + cpu_park_loop();
>> +
>> +}
>> +
>> +static int smp_spin_table_cpu_kill(unsigned int cpu)
>> +{
>> + unsigned long start, end;
>> +
>> + start = jiffies;
>> + end = start + msecs_to_jiffies(100);
>> +
>> + do {
>> + if (!cpu_online(cpu)) {
>> + pr_info("CPU%d killed\n", cpu);
>> + return 0;
>> + }
>> + } while (time_before(jiffies, end));
>> + pr_warn("CPU%d may not have shut down cleanly\n", cpu);
>> + return -ETIMEDOUT;
>> +
>> +}
>
> If we're going to extend this, we must add a mechanism to reliably
> identify when the CPU has been returned successfully. We can't rely on
> cpu_online(), becuase there's a window between the CPU marking itself as
> offline and actually exiting the kernel.
>
>> +
>> +/* Nothing to do here */
>> +static int smp_spin_table_cpu_disable(unsigned int cpu)
>> +{
>> + return 0;
>> +}
>
> For implementations where we cannot return the CPU, cpu_disable() *must*
> fail.
>
> Thanks,
> Mark.

Thanks for taking the time to review this.

Henry