Interference of CPU hotplug on CPU isolation and Real-Time tasks

From: Costa Shulyupin
Date: Mon Dec 09 2024 - 02:11:22 EST


Hello

Simplified test:
rtla timerlat hist -c 1 -a 500 &
echo 0 > /sys/devices/system/cpu/cpu11/online

RTLA reveals blocking thread stack trace:
...
-> multi_cpu_stop
-> cpu_stopper_thread
-> smpboot_thread_fn
...

I've found that multi_cpu_stop() disables interrupts for EACH online
CPU because takedown_cpu() indirectly invokes take_cpu_down() through
stop_machine_cpuslocked(). I'm omitting the detailed description of
the call chain.

Potentially using stop_one_cpu() instead of stop_machine_cpuslocked()
could solve the problem:

@@ -1335,7 +1339,7 @@ static int takedown_cpu(unsigned int cpu)
/*
* So now all preempt/rcu users must observe !cpu_active().
*/
- err = stop_machine_cpuslocked(take_cpu_down, NULL, cpumask_of(cpu));
+ err = stop_one_cpu(cpu, take_cpu_down, NULL);

Original stop_machine code was introduced 20 years ago:
Author: rusty <rusty>
Date: Fri Mar 19 16:02:28 2004 +0000

[PATCH] Hotplug CPUs: cpu_down()

Implement cpu_down(): uses stop_machine to freeze the machine, then
uses (arch-specific) __cpu_disable() and migrate_all_tasks().

Whole thing under CONFIG_HOTPLUG_CPU, so doesn't break archs which
don't define that.

https://github.com/jeffmahoney/linux-pre-git/commit/864a81b15223552102124656a012ac6de6947499#diff-52e4b09f63a029f319f95a60ddc0a09c31de0e172f8a2802ce39294569e60587R122

Additionally, take_cpu_down() relies on local_irq_save() and
hard_irq_disable(). However, I am omitting this patch to concentrate
solely on stop_one_cpu().

Questions:
1. Why stop_machine() is used during the CPU hotplug?
2. Is it worth testing using stop_one_cpu(), or would that be the
wrong approach?
3. Do you have any additional recommendations?

Thanks
Costa