Re: [PATCH] tracing/osnoise: Fix possible recursive locking for cpus_read_lock()

From: Ran Xiaokai
Date: Mon Mar 17 2025 - 08:53:53 EST


>On Tue, Feb 25, 2025 at 12:31:32PM +0000, Ran Xiaokai wrote:
>> From: Ran Xiaokai <ran.xiaokai@xxxxxxxxxx>
>>
>> Lockdep reports this deadlock log:
>> ============================================
>> WARNING: possible recursive locking detected
>> --------------------------------------------
>> sh/31444 is trying to acquire lock:
>> ffffffff82c51af0 (cpu_hotplug_lock){++++}-{0:0}, at:
>> stop_per_cpu_kthreads+0x7/0x60
>>
>> but task is already holding lock:
>> ffffffff82c51af0 (cpu_hotplug_lock){++++}-{0:0}, at:
>> start_per_cpu_kthreads+0x28/0x140
>>
>> other info that might help us debug this:
>> Possible unsafe locking scenario:
>>
>> CPU0
>> ----
>> lock(cpu_hotplug_lock);
>> lock(cpu_hotplug_lock);
>>
>> Call Trace:
>> <TASK>
>> __lock_acquire+0x1612/0x29b0
>> lock_acquire+0xd0/0x2e0
>> cpus_read_lock+0x49/0x120
>> stop_per_cpu_kthreads+0x7/0x60
>> start_kthread+0x105/0x120
>> start_per_cpu_kthreads+0xdd/0x140
>> osnoise_workload_start+0x261/0x2f0
>> osnoise_tracer_start+0x18/0x4
>>
>> In start_kthread(), when kthread_run_on_cpu() fails,
>> cpus_read_unlock() should be called before stop_per_cpu_kthreads(),
>> but both start_per_cpu_kthreads() and start_kthread() call the error
>> handling routine stop_per_cpu_kthreads(),
>> which is redundant. Only one call is necessary.
>> To fix this, move stop_per_cpu_kthreads() outside of start_kthread(),
>> use the return value of start_kthread() to determine kthread creation
>> error.
>> The same issue exists in osnoise_hotplug_workfn() too.
>>
>> Reviewed-by: Yang Guang <yang.guang5@xxxxxxxxxx>
>> Reviewed-by: Wang Yong <wang.yong12@xxxxxxxxxx>
>> Signed-off-by: Ran Xiaokai <ran.xiaokai@xxxxxxxxxx>
>> ---
>> kernel/trace/trace_osnoise.c | 10 +++++++---
>> 1 file changed, 7 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
>> index 92e16f03fa4e..38fb0c655f5b 100644
>> --- a/kernel/trace/trace_osnoise.c
>> +++ b/kernel/trace/trace_osnoise.c
>> @@ -2029,7 +2029,6 @@ static int start_kthread(unsigned int cpu)
>>
>> if (IS_ERR(kthread)) {
>> pr_err(BANNER "could not start sampling thread\n");
>> - stop_per_cpu_kthreads();
>> return -ENOMEM;
>> }
>>
>> @@ -2097,7 +2096,7 @@ static void osnoise_hotplug_workfn(struct
>> work_struct *dummy)
>> return;
>>
>> guard(mutex)(&interface_lock);
>> - guard(cpus_read_lock)();
>> + cpus_read_lock();
>>
>> if (!cpu_online(cpu))
>> return;
>> @@ -2105,7 +2104,12 @@ static void osnoise_hotplug_workfn(struct
>> work_struct *dummy)
>> if (!cpumask_test_cpu(cpu, &osnoise_cpumask))
>> return;
>>
>> - start_kthread(cpu);
>> + if (start_kthread(cpu)) {
>> + cpus_read_unlock();
>> + stop_per_cpu_kthreads();
>
>Is it right to call stop_per_cpu_kthreads() which stops osnoise kthread
>for every other CPUs in the system if a failure occurs during hotplug of a
>CPU?

I also suspect that this is not a rational behavior.

>On another note, since stop_per_cpu_kthreads() invokes stop_kthread()
>for every online CPU. It's better to remove stop_per_cpu_kthreads() from
>start_kthread() and handle the error in `osnoise_hotplug_workfn`

Hi, Vishal
I did this in my first versin, something like this:

@@ -2097,7 +2096,7 @@ static void osnoise_hotplug_workfn(struct
work_struct *dummy)
return;

guard(mutex)(&interface_lock);
- guard(cpus_read_lock)();
+ cpus_read_lock();

if (!cpu_online(cpu)) {
+ cpus_read_unlock();
return;
}

if (!cpumask_test_cpu(cpu, &osnoise_cpumask)) {
+ cpus_read_unlock();
return;
}
- start_kthread(cpu);
+ if (start_kthread(cpu)) {
+ cpus_read_unlock();
+ stop_per_cpu_kthreads();
+ return;
+ }
+ cpus_read_unlock();
}

We have to drop the guard() and call unlock() manually,
this somewhat makes the code redundant.

> Vishal
>> + return;
>> + }
>> + cpus_read_unlock();
>> }
>>
>> static DECLARE_WORK(osnoise_hotplug_work, osnoise_hotplug_workfn);
>> --
>> 2.15.2
>>