Re: [PATCH RFC 3/3] arm64: Add HOTPLUG_PARALLEL support for secondary CPUs

From: Jinjie Ruan

Date: Wed Jun 24 2026 - 06:01:18 EST




On 6/23/2026 10:30 PM, Will Deacon wrote:
> On Mon, Jun 22, 2026 at 04:06:38PM +0800, Jinjie Ruan wrote:
>> On 6/18/2026 8:21 PM, Will Deacon wrote:
>>> On Mon, Jun 15, 2026 at 04:51:48PM +0800, Jinjie Ruan wrote:
>>>> On 6/12/2026 11:45 PM, Michael Kelley wrote:
>>>>> From: Jinjie Ruan <ruanjinjie@xxxxxxxxxx> Sent: Thursday, June 11, 2026 6:38 AM
>>>>>>
>>>>>> Support for parallel secondary CPU bringup is already utilized by x86,
>>>>>> MIPS, and RISC-V. This patch brings this capability to the arm64
>>>>>> architecture.
>>>>>>
>>>>>> Rework the global `secondary_data` accessed during early boot into
>>>>>> a per-CPU array. This array maps logical CPU IDs to MPIDR_EL1 values,
>>>>>> enabling the early boot code in head.S to resolve each secondary CPU's
>>>>>> logical ID concurrently.
>>>>>>
>>>>>> To fully enable HOTPLUG_PARALLEL, this patch implements:
>>>>>> 1) An arm64-specific arch_cpuhp_kick_ap_alive() handler.
>>>>>> 2) Callbacks to cpuhp_ap_sync_alive() inside secondary_start_kernel().
>>>>>>
>>>>>> Successfully tested on QEMU ARM64 virt machine (KVM on, 128 vCPUs).
>>>>>>
>>>>>> | test kernel | secondary CPUs boot time |
>>>>>> | --------------------- | -------------------- |
>>>>>> | Without this patch | 155.672 |
>>>>>> | cpuhp.parallel=0 | 62.897 |
>>>>>> | cpuhp.parallel=1 | 166.703 |
>>>>>
>>>>> The last two rows seem mixed up. I would expect parallel=0 to
>>>>> result in a longer boot time.
>>>>
>>>> The results are correct and not mixed up.
>>>>
>>>> Compared to the original non‑HOTPLUG_PARALLEL approach, the advantage of
>>>> cpuhp.parallel=0 lies in its use of cpu_relax(`yield` on arm64) instead
>>>> of the wait_for_completion_timeout() mechanism (which may cause sleep
>>>> and context switching). This significantly reduces the overhead of VM
>>>> exits and context switches in a KVM guest, thereby cutting the secondary
>>>> CPU boot time by more than half.
>>>
>>> I don't think that's a particularly compelling reason to enable this for
>>> arm64, in all honesty. The yield instruction typically doesn't do
>>> anything on actual arm64 silicon, so this probably means that you're
>>> introducing busy-loops which tend to be bad for power and scalability.
>>
>> After updating the implementation in v2, the performance gains are
>> primarily observed on actual hardware.
>
> ... but that's presumably because the secondary cores are busy-looping.
> That's not something we should do during boot. It might be "fast" on
> your machine but it will probably be "hot" as well.

Hi, Will,

I see your point regarding the 'hot boot' issue, which is indeed a valid
concern for power-constrained devices,

My optimization is tailored for servers and continuously powered single
boards, where boot-up speed is much more critical than temporary power
usage during the early boot phase.

Perhaps we could replace the "yield" instruction with "WFE / SEV"
instructions to coordinate the parallel boot of the primary and
secondary cores. This approach would allow the secondary cores to enter
a low-power standby state rather than busy-looping, effectively
preventing the thermal and power issues on battery-constrained machines.

>
>>> I implemented this a while ago [1] but didn't manage to see much in terms
>>> of performance improvement and so I didn't bother to send the patches out
>>
>> As shown in v2 below, on actual hardware, this results in a 40%–60%
>> reduction in boot time.
>>
>> Bringup Time Comparison (ms, lower is better):
>>
>> | Platform | Baseline| P=0 | P=1 | Delta(%)|
>> | --------------------- | ------- | ------- | ------ | ------- |
>> | 64-core ATF QEMU | 2075.8 | 2080.7 | 1653.4 | 20.34% |
>> | 192-core server(HIP12)| 14619.2 | 14619.1 | 8589.4 | 41.21% |
>> | 32-core board | 2776.5 | 2881.0 | 1045.0 | 62.36% |
>>
>> Link:
>> https://lore.kernel.org/all/20260618092444.1316336-5-ruanjinjie@xxxxxxxxxx/
>
> To be honest, I'm pretty confused with all these numbers. Your first
> table above suggests that parallel boot is *slower* but then this table
> suggests the opposite. However, it also has a QEMU entry despite being
> "on actual hardware". Is that in a VM?

Sorry, there is a little confused. 192-core server(HIP12) and 32-core
board are tested on real hardware, which has 40%–60% reduction in boot time.

>
>>> after talking about it at KVM forum [2]. However, as mentioned at the end
>>> of that talk, it _is_ still useful for confidential VMs using PSCI so
>>> let me dust off my old series and send it out to see what you think.
>>>
>>> It relies on PSCI v0.2, which means we don't need the NR_CPUS size array
>>> for secondary_data and I also have some support for error handling (it
>>> doesn't look like you handle __early_cpu_boot_status properly).
>>
>> I need some time to look closely at your patch. Alternatively, I will
>> integrate your changes, re-test everything on actual hardware, and then
>> send out a revised version.
>
> Please just give me a week or so to rebase my changes and send them out
> for discussion. It'll be interesting to see what numbers you get.

Sounds good! Take your time, and I'm looking forward to your series.

In the meantime, I have just sent out v3 of this patch. While working
closely with your previous code, I identified a few bugs (including the
multi-CPU status trampling issue we discussed) and addressed them in
this new version.I wanted to share v3 with you now so you can easily
review the fixes and potentially integrate them when you rebase your
changes next week. It also includes the updated performance numbers on
my setup for your reference.

Link:
https://lore.kernel.org/all/20260624092537.2916971-1-ruanjinjie@xxxxxxxxxx/

Looking forward to the discussion!

Best regards,
Jinjie

>
>> It seems that the following patch removing
>> `rcutree_report_cpu_starting()` will reintroduce the original issue as
>> commit ce3d31ad3cac ("arm64/smp: Move
>> rcu_cpu_starting() earlier") soloved.
>>
>> Link:
>> https://web.git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/commit/?h=cpu-hotplug&id=bba4b62f45f2614bf6085e6cd3f233528f85bf26
>>
>> Indeed, I also noticed that the invocation order of
>> rcutree_report_cpu_starting() on arm64 is somewhat suboptimal. It
>> hinders the implementation of parallel bringup on arm64 and could
>> potentially lead to RCU stalls.
>>
>> Link:
>> https://lore.kernel.org/all/20260618092444.1316336-4-ruanjinjie@xxxxxxxxxx/
>>
>> [ 0.329017] smp: Bringing up secondary CPUs ...
>> [ 0.343628] Detected VIPT I-cache on CPU1
>> [ 0.343788]
>> [ 0.343806] =============================
>> [ 0.343816] WARNING: suspicious RCU usage
>> [ 0.343966] 7.1.0-rc1-g27c1871848a2 #109 Not tainted
>> [ 0.344087] -----------------------------
>> [ 0.344098] kernel/locking/lockdep.c:3801 RCU-list traversed in
>> non-reader section!!
>
> Thanks, I'll look into this.
>
> Will
>