[Question] Call trace occurs occasionally when a rollback is performed upon CPU online timeout

From: Kunkun Jiang
Date: Wed Jan 15 2025 - 07:33:01 EST


Hi all,

I have a question about CPU online/offline. In the following test scenario, various tasks(iperf,fio,sve,...) are executed in a VM with 6 vCPUs. At the same time, repeat online/offline operations on two of the vCPUs through /sys/devices/system/cpu/cpuX/online. After running for many hours,some calltrace will appear in the guest.
The first, WARN_ON_ONCE(test_bit(KTHREAD_SHOULD_PARK, &kthread->flags)) is triggered.
Call trace:
kthread_park+0xd0/0xdc
takedown_cpu+0x4c/0x140
cpuhp_invoke_callback+0x160/0x6e0
_cpu_up+0x1a4/0x200
cpu_up+0xbc/0x100
cpu_device_up+0x20/0x30
cpu_subsys_online+0x4c/0xb0
device_online+0x7c/0xa0
online_store+0xd0/0xe0
dev_attr_store+0x20/0x34
sysfs_kf_write+0x4c/0x5c
kernfs_fop_write_iter+0x130/0x1c0
new_sync_write+0xec/0x18c
vfs_write+0x214/0x2ac
ksys_write+0x70/0xfc
__arm64_sys_write+0x24/0x30
invoke_syscall+0x50/0x11c
el0_svc_common.constprop.0+0x68/0x164
do_el0_svc+0x34/0xcc
el0_svc+0x20/0x30
el0_sync_handler+0xb8/0xc0
el0_sync+0x160/0x180

The second, BUG_ON(!irqs_disabled() && !IS_ENABLED(CONFIG_PREEMPT_RT)) is triggered.
Call trace:
irq_work_run_list+0x64/0x70
smpcfd_dying_cpu+0x24/0x34
cpuhp_invoke_callback+0x160/0x6e0
_cpu_up+0x1a4/0x200
cpu_up+0xbc/0x100
cpu_device_up+0x20/0x30
cpu_subsys_online+0x4c/0xb0
device_online+0x7c/0xa0
online_store+0xd0/0xe0
dev_attr_store+0x20/0x34
sysfs_kf_write+0x4c/0x5c
kernfs_fop_write_iter+0x130/0x1c0
new_sync_write+0xec/0x18c
vfs_write+0x214/0x2ac
ksys_write+0x70/0xfc
__arm64_sys_write+0x24/0x30
invoke_syscall+0x50/0x11c
el0_svc_common.constprop.0+0x68/0x164
do_el0_svc+0x34/0xcc
el0_svc+0x20/0x30
el0_sync_handler+0xb8/0xc0
el0_sync+0x160/0x180

According to my analysis, the root cause of the question is because the vCPU online times out, but in fact the vCPU was successfully online. Rollback is performed due to timeout. During the rollback, the secondary_start_kernel is still executing, resulting in the above call trace. So is this a bug? If so, how should it be repaired?

The reason for the timeout has not been found. It is suspected that it is caused by excessive task pressure. If you have other ideas, please point them out.

Thanks,
Kunkun Jiang