On Tue, Feb 07, 2023 at 11:04:27PM +0000, Usama Arif wrote:
Tested on v7, doing INIT/SIPI/SIPI in parallel brings down the time for
smpboot from ~700ms to 100ms (85% improvement) on a server with 128 CPUs
split across 2 NUMA nodes.
The major change over v6 is keeping parallel smp support enabled in AMD.
APIC ID for parallel CPU bringup is now obtained from CPUID leaf 0x0B
(for x2APIC mode) otherwise CPUID leaf 0x1 (8 bits).
The patch for reusing timer calibration for secondary CPUs is also removed
from the series as its not part of parallel smp bringup and needs to be
further thought about.
Running rcutorture on this got me the following NULL pointer dereference
on scenario TREE01:
------------------------------------------------------------------------
[ 34.662066] smpboot: CPU 0 is now offline
[ 34.674075] rcu: NOCB: Cannot CB-offload offline CPU 25
[ 35.038003] rcu: De-offloading 5
[ 35.112997] rcu: Offloading 12
[ 35.716011] smpboot: Booting Node 0 Processor 0 APIC 0x0
[ 35.762685] BUG: kernel NULL pointer dereference, address: 0000000000000001
[ 35.764278] #PF: supervisor instruction fetch in kernel mode
[ 35.765530] #PF: error_code(0x0010) - not-present page
[ 35.766700] PGD 0 P4D 0
[ 35.767278] Oops: 0010 [#1] PREEMPT SMP PTI
[ 35.768223] CPU: 36 PID: 0 Comm: swapper/36 Not tainted 6.2.0-rc1-00206-g18a37610b632-dirty #3563
[ 35.770201] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
------------------------------------------------------------------------
Given an x86 system with KVM and qemu, this can be reproduced by running
the following from the top-level directory in the Linux-kernel source
tree:
tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --configs "TREE01 TINY01" --trust-make
Out of 15 runs, 14 blew up just after the first attempt to bring CPU
0 back online. The 15th run blew up just after the second attempt to
bring CPU 0 online, the first attempt having succeeded.
My guess is that the CONFIG_BOOTPARAM_HOTPLUG_CPU0=y Kconfig option is
tickling this bug. This Kconfig option has been added to the TREE01
scenario in the -rcu tree's "dev" branch, which might mean that this test
would pass on mainline. But CONFIG_BOOTPARAM_HOTPLUG_CPU0=y is not new,
only rcutorture's testing of it.
Thoughts?
Thanx, Paul