Re: [PATCH RFC 3/3] arm64: Add HOTPLUG_PARALLEL support for secondary CPUs
From: Jinjie Ruan
Date: Mon Jun 22 2026 - 04:14:26 EST
On 6/18/2026 8:21 PM, Will Deacon wrote:
> Hi Jinjie,
>
> On Mon, Jun 15, 2026 at 04:51:48PM +0800, Jinjie Ruan wrote:
>> On 6/12/2026 11:45 PM, Michael Kelley wrote:
>>> From: Jinjie Ruan <ruanjinjie@xxxxxxxxxx> Sent: Thursday, June 11, 2026 6:38 AM
>>>>
>>>> Support for parallel secondary CPU bringup is already utilized by x86,
>>>> MIPS, and RISC-V. This patch brings this capability to the arm64
>>>> architecture.
>>>>
>>>> Rework the global `secondary_data` accessed during early boot into
>>>> a per-CPU array. This array maps logical CPU IDs to MPIDR_EL1 values,
>>>> enabling the early boot code in head.S to resolve each secondary CPU's
>>>> logical ID concurrently.
>>>>
>>>> To fully enable HOTPLUG_PARALLEL, this patch implements:
>>>> 1) An arm64-specific arch_cpuhp_kick_ap_alive() handler.
>>>> 2) Callbacks to cpuhp_ap_sync_alive() inside secondary_start_kernel().
>>>>
>>>> Successfully tested on QEMU ARM64 virt machine (KVM on, 128 vCPUs).
>>>>
>>>> | test kernel | secondary CPUs boot time |
>>>> | --------------------- | -------------------- |
>>>> | Without this patch | 155.672 |
>>>> | cpuhp.parallel=0 | 62.897 |
>>>> | cpuhp.parallel=1 | 166.703 |
>>>
>>> The last two rows seem mixed up. I would expect parallel=0 to
>>> result in a longer boot time.
>>
>> Hi, Michael,
>>
>> The results are correct and not mixed up.
>>
>> Compared to the original non‑HOTPLUG_PARALLEL approach, the advantage of
>> cpuhp.parallel=0 lies in its use of cpu_relax(`yield` on arm64) instead
>> of the wait_for_completion_timeout() mechanism (which may cause sleep
>> and context switching). This significantly reduces the overhead of VM
>> exits and context switches in a KVM guest, thereby cutting the secondary
>> CPU boot time by more than half.
>
> I don't think that's a particularly compelling reason to enable this for
> arm64, in all honesty. The yield instruction typically doesn't do
> anything on actual arm64 silicon, so this probably means that you're
> introducing busy-loops which tend to be bad for power and scalability.
After updating the implementation in v2, the performance gains are
primarily observed on actual hardware.
>
> I implemented this a while ago [1] but didn't manage to see much in terms
> of performance improvement and so I didn't bother to send the patches out
As shown in v2 below, on actual hardware, this results in a 40%–60%
reduction in boot time.
Bringup Time Comparison (ms, lower is better):
| Platform | Baseline| P=0 | P=1 | Delta(%)|
| --------------------- | ------- | ------- | ------ | ------- |
| 64-core ATF QEMU | 2075.8 | 2080.7 | 1653.4 | 20.34% |
| 192-core server(HIP12)| 14619.2 | 14619.1 | 8589.4 | 41.21% |
| 32-core board | 2776.5 | 2881.0 | 1045.0 | 62.36% |
Link:
https://lore.kernel.org/all/20260618092444.1316336-5-ruanjinjie@xxxxxxxxxx/
> after talking about it at KVM forum [2]. However, as mentioned at the end
> of that talk, it _is_ still useful for confidential VMs using PSCI so
> let me dust off my old series and send it out to see what you think.
>
> It relies on PSCI v0.2, which means we don't need the NR_CPUS size array
> for secondary_data and I also have some support for error handling (it
> doesn't look like you handle __early_cpu_boot_status properly).
I need some time to look closely at your patch. Alternatively, I will
integrate your changes, re-test everything on actual hardware, and then
send out a revised version.
>
> It looks like I could include your first patch, though!
Thank you very much.
>
> Will
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=cpu-hotplug
It seems that the following patch removing
`rcutree_report_cpu_starting()` will reintroduce the original issue as
commit ce3d31ad3cac ("arm64/smp: Move
rcu_cpu_starting() earlier") soloved.
Link:
https://web.git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/commit/?h=cpu-hotplug&id=bba4b62f45f2614bf6085e6cd3f233528f85bf26
Indeed, I also noticed that the invocation order of
rcutree_report_cpu_starting() on arm64 is somewhat suboptimal. It
hinders the implementation of parallel bringup on arm64 and could
potentially lead to RCU stalls.
Link:
https://lore.kernel.org/all/20260618092444.1316336-4-ruanjinjie@xxxxxxxxxx/
[ 0.329017] smp: Bringing up secondary CPUs ...
[ 0.343628] Detected VIPT I-cache on CPU1
[ 0.343788]
[ 0.343806] =============================
[ 0.343816] WARNING: suspicious RCU usage
[ 0.343966] 7.1.0-rc1-g27c1871848a2 #109 Not tainted
[ 0.344087] -----------------------------
[ 0.344098] kernel/locking/lockdep.c:3801 RCU-list traversed in
non-reader section!!
[ 0.344112]
[ 0.344112] other info that might help us debug this:
[ 0.344112]
[ 0.344135]
[ 0.344135] RCU used illegally from offline CPU!
[ 0.344135] rcu_scheduler_active = 1, debug_locks = 1
[ 0.344174] no locks held by swapper/1/0.
[ 0.344204]
[ 0.344204] stack backtrace:
[ 0.344611] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted
7.1.0-rc1-g27c1871848a2 #109 PREEMPT
[ 0.344707] Hardware name: linux,dummy-virt (DT)
[ 0.345267] Call trace:
[ 0.345436] show_stack+0x18/0x24 (C)
[ 0.345593] dump_stack_lvl+0x90/0xd0
[ 0.345620] dump_stack+0x18/0x24
[ 0.345639] lockdep_rcu_suspicious+0x170/0x234
[ 0.345665] __lock_acquire+0xdd4/0x2078
[ 0.345688] lock_acquire+0x1c4/0x3f0
[ 0.345711] _raw_spin_lock_irqsave+0x60/0x88
[ 0.345736] down_trylock+0x18/0x48
[ 0.345758] __down_trylock_console_sem+0x38/0xc4
[ 0.345782] vprintk_emit+0x23c/0x3d0
[ 0.345802] vprintk_default+0x38/0x44
[ 0.345822] vprintk+0x28/0x34
[ 0.345841] _printk+0x5c/0x84
[ 0.345864] cpuinfo_store_cpu+0x174/0x298
[ 0.345884] secondary_start_kernel+0xbc/0x150
[ 0.345905] __secondary_switched+0xc0/0xc4
[ 0.350307] GICv3: CPU1: found redistributor 1 region
0:0x00000000080c0000
[ 0.350523] GICv3: CPU1: using allocated LPI pending table
@0x00000001042f0000
[ 0.351303] CPU1: Booted secondary processor 0x0000000001 [0x410fd034]
[ 0.387425] Detected VIPT I-cache on CPU2
> [2] https://www.youtube.com/watch?v=Q6kOshnnQuE
>