arm64: BUG report: Hugepages leads cpu failed to come online.

From: chengdongli(李成栋)
Date: Wed Sep 16 2020 - 14:25:06 EST


Hello folks,

We observed that Arm VM instance (Centos 8) runs into timeout at boot time when it is launched by Qemu/KVM on an Arm server host that enabled hugepages (2MB/1GB).
We also noticed that Will has a workaround that applies to the guest kernel (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0e1645557d19fc6d88d3c40431f63a3c3a4c417b)
But, it simply increased the timeout time from 1s to 5s and didn't accepted by the upstream so far for some reasons. Moreover, even it is accepted by the upstream, it should take a long time for the distributions like CentOS to backport it.

We are thinking that if we have a chance to fix this problem at host side. Because most cloud providers need less time than distributions to upgrade their software, then eventually, no need to upgrade guest OS.
We have investigated the root cause for a few weeks, and barely understand this problem may related to page locking contention when guest kernel allocates per-cpu memory for each CPU at boot time.

However, we are not so sure about the root cause and hopping you could share your expertise on this. Really appreciate if you could share your insights!


Reproduce step:
=============
- Enable 1GB hugepage on the host kernel side.
- Start vm using following command (qemu must support passing cpu topology through acpi&dt):
# qemu-system-aarch64 -name guest=12345,debug-threads=on \
-machine virt,accel=kvm,usb=off,dump-guest-core=off,gic-version=3 \
-cpu host \
-m 122880 \
-object iothread,id=iothread1 \
-object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=/dev/hugepages,share=yes,size=64424509440,host-nodes=0,policy=preferred \
-numa node,nodeid=0,cpus=0-29,memdev=ram-node0 \
-object memory-backend-file,id=ram-node1,prealloc=yes,mem-path=/dev/hugepages,share=yes,size=64424509440,host-nodes=1,policy=preferred \
-numa node,nodeid=1,cpus=30-59,memdev=ram-node1 \
-smp 60,sockets=2,cores=30,threads=1 \
-bios /usr/share/AAVMF/AAVMF_CODE.fd \
-drive file=centos8.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none \
-device virtio-blk-pci,scsi=off,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
-net none -serial telnet::9001,server,nowait -monitor stdio

Guest kernel boot timeout log sample:
===============================

[ 0.017636] smp: Bringing up secondary CPUs ...
[ 1.039073] CPU1: failed to come online
[ 1.039486] CPU1: failed in unknown state : 0x0
[ 1.052547] Detected VIPT I-cache on CPU2
[ 1.052563] GICv3: CPU2: found redistributor 2 region 0:0x00000000080e0000
[ 1.052708] GICv3: CPU2: using allocated LPI pending table @0x0000000f006a0000
[ 1.052729] CPU2: Booted secondary processor 0x0000000001 [0x481fd010]
[ 2.079846] CPU3: failed to come online
[ 2.082698] CPU3: failed in unknown state : 0x0
[ 2.086925] Detected VIPT I-cache on CPU4
[ 2.086947] GICv3: CPU4: found redistributor 4 region 0:0x0000000008120000
[ 2.087232] GICv3: CPU4: using allocated LPI pending table @0x0000000f006c0000
[ 2.087259] CPU4: Booted secondary processor 0x0000000002 [0x481fd010]
[ 3.120634] CPU5: failed to come online
[ 3.121691] Detected VIPT I-cache on CPU5
[ 3.121754] CPU5: failed in unknown state : 0x0
[ 3.121774] GICv3: CPU5: found redistributor 5 region 0:0x0000000008140000


Linux perf output (host side) while guest kernel booting:
==============================================

- 19.75% worker [kernel.kallsyms] [k] queued_spin_lock_slowpath
- queued_spin_lock_slowpath
- 12.98% kvm_mmu_notifier_invalidate_range_end
__mmu_notifier_invalidate_range_end
zap_page_range
__arm64_sys_madvise
el0_svc_common.constprop.3
el0_svc_handler
el0_svc
__madvise
tcmalloc::PageHeap::DecommitSpan
tcmalloc::PageHeap::MergeIntoFreeList
tcmalloc::PageHeap::Delete
tcmalloc::CentralFreeList::ReleaseToSpans
tcmalloc::CentralFreeList::ReleaseListToSpans
tcmalloc::CentralFreeList::InsertRange
tcmalloc::ThreadCache::ReleaseToCentralCache
tcmalloc::ThreadCache::Cleanup
tcmalloc::ThreadCache::DeleteCache
__nptl_deallocate_tsd
start_thread
thread_start
- 6.77% kvm_mmu_notifier_invalidate_range_start
__mmu_notifier_invalidate_range_start
zap_page_range
__arm64_sys_madvise
el0_svc_common.constprop.3
el0_svc_handler
el0_svc
__madvise
tcmalloc::PageHeap::DecommitSpan
tcmalloc::PageHeap::MergeIntoFreeList
tcmalloc::PageHeap::Delete
tcmalloc::CentralFreeList::ReleaseToSpans
tcmalloc::CentralFreeList::ReleaseListToSpans
tcmalloc::CentralFreeList::InsertRange
tcmalloc::ThreadCache::ReleaseToCentralCache
tcmalloc::ThreadCache::Cleanup
tcmalloc::ThreadCache::DeleteCache
__nptl_deallocate_tsd
start_thread
thread_start


Thanks,
Chengdong