Re: [PATCH RFC 3/3] arm64: Add HOTPLUG_PARALLEL support for secondary CPUs
From: Will Deacon
Date: Tue Jun 23 2026 - 10:31:22 EST
On Mon, Jun 22, 2026 at 04:06:38PM +0800, Jinjie Ruan wrote:
> On 6/18/2026 8:21 PM, Will Deacon wrote:
> > On Mon, Jun 15, 2026 at 04:51:48PM +0800, Jinjie Ruan wrote:
> >> On 6/12/2026 11:45 PM, Michael Kelley wrote:
> >>> From: Jinjie Ruan <ruanjinjie@xxxxxxxxxx> Sent: Thursday, June 11, 2026 6:38 AM
> >>>>
> >>>> Support for parallel secondary CPU bringup is already utilized by x86,
> >>>> MIPS, and RISC-V. This patch brings this capability to the arm64
> >>>> architecture.
> >>>>
> >>>> Rework the global `secondary_data` accessed during early boot into
> >>>> a per-CPU array. This array maps logical CPU IDs to MPIDR_EL1 values,
> >>>> enabling the early boot code in head.S to resolve each secondary CPU's
> >>>> logical ID concurrently.
> >>>>
> >>>> To fully enable HOTPLUG_PARALLEL, this patch implements:
> >>>> 1) An arm64-specific arch_cpuhp_kick_ap_alive() handler.
> >>>> 2) Callbacks to cpuhp_ap_sync_alive() inside secondary_start_kernel().
> >>>>
> >>>> Successfully tested on QEMU ARM64 virt machine (KVM on, 128 vCPUs).
> >>>>
> >>>> | test kernel | secondary CPUs boot time |
> >>>> | --------------------- | -------------------- |
> >>>> | Without this patch | 155.672 |
> >>>> | cpuhp.parallel=0 | 62.897 |
> >>>> | cpuhp.parallel=1 | 166.703 |
> >>>
> >>> The last two rows seem mixed up. I would expect parallel=0 to
> >>> result in a longer boot time.
> >>
> >> The results are correct and not mixed up.
> >>
> >> Compared to the original non‑HOTPLUG_PARALLEL approach, the advantage of
> >> cpuhp.parallel=0 lies in its use of cpu_relax(`yield` on arm64) instead
> >> of the wait_for_completion_timeout() mechanism (which may cause sleep
> >> and context switching). This significantly reduces the overhead of VM
> >> exits and context switches in a KVM guest, thereby cutting the secondary
> >> CPU boot time by more than half.
> >
> > I don't think that's a particularly compelling reason to enable this for
> > arm64, in all honesty. The yield instruction typically doesn't do
> > anything on actual arm64 silicon, so this probably means that you're
> > introducing busy-loops which tend to be bad for power and scalability.
>
> After updating the implementation in v2, the performance gains are
> primarily observed on actual hardware.
... but that's presumably because the secondary cores are busy-looping.
That's not something we should do during boot. It might be "fast" on
your machine but it will probably be "hot" as well.
> > I implemented this a while ago [1] but didn't manage to see much in terms
> > of performance improvement and so I didn't bother to send the patches out
>
> As shown in v2 below, on actual hardware, this results in a 40%–60%
> reduction in boot time.
>
> Bringup Time Comparison (ms, lower is better):
>
> | Platform | Baseline| P=0 | P=1 | Delta(%)|
> | --------------------- | ------- | ------- | ------ | ------- |
> | 64-core ATF QEMU | 2075.8 | 2080.7 | 1653.4 | 20.34% |
> | 192-core server(HIP12)| 14619.2 | 14619.1 | 8589.4 | 41.21% |
> | 32-core board | 2776.5 | 2881.0 | 1045.0 | 62.36% |
>
> Link:
> https://lore.kernel.org/all/20260618092444.1316336-5-ruanjinjie@xxxxxxxxxx/
To be honest, I'm pretty confused with all these numbers. Your first
table above suggests that parallel boot is *slower* but then this table
suggests the opposite. However, it also has a QEMU entry despite being
"on actual hardware". Is that in a VM?
> > after talking about it at KVM forum [2]. However, as mentioned at the end
> > of that talk, it _is_ still useful for confidential VMs using PSCI so
> > let me dust off my old series and send it out to see what you think.
> >
> > It relies on PSCI v0.2, which means we don't need the NR_CPUS size array
> > for secondary_data and I also have some support for error handling (it
> > doesn't look like you handle __early_cpu_boot_status properly).
>
> I need some time to look closely at your patch. Alternatively, I will
> integrate your changes, re-test everything on actual hardware, and then
> send out a revised version.
Please just give me a week or so to rebase my changes and send them out
for discussion. It'll be interesting to see what numbers you get.
> It seems that the following patch removing
> `rcutree_report_cpu_starting()` will reintroduce the original issue as
> commit ce3d31ad3cac ("arm64/smp: Move
> rcu_cpu_starting() earlier") soloved.
>
> Link:
> https://web.git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/commit/?h=cpu-hotplug&id=bba4b62f45f2614bf6085e6cd3f233528f85bf26
>
> Indeed, I also noticed that the invocation order of
> rcutree_report_cpu_starting() on arm64 is somewhat suboptimal. It
> hinders the implementation of parallel bringup on arm64 and could
> potentially lead to RCU stalls.
>
> Link:
> https://lore.kernel.org/all/20260618092444.1316336-4-ruanjinjie@xxxxxxxxxx/
>
> [ 0.329017] smp: Bringing up secondary CPUs ...
> [ 0.343628] Detected VIPT I-cache on CPU1
> [ 0.343788]
> [ 0.343806] =============================
> [ 0.343816] WARNING: suspicious RCU usage
> [ 0.343966] 7.1.0-rc1-g27c1871848a2 #109 Not tainted
> [ 0.344087] -----------------------------
> [ 0.344098] kernel/locking/lockdep.c:3801 RCU-list traversed in
> non-reader section!!
Thanks, I'll look into this.
Will