Re: [PATCH] tracing/osnoise: fix potential deadlock in cpu hotplug
From: Steven Rostedt
Date: Wed Mar 25 2026 - 10:35:54 EST
On Wed, 25 Mar 2026 10:25:42 +0800 (CST)
<hu.shengming@xxxxxxxxxx> wrote:
> >On Tue, 24 Mar 2026 15:06:16 +0800 (CST)
> ><hu.shengming@xxxxxxxxxx> wrote:
> >
> >> From: luohaiyang10243395 <luo.haiyang@xxxxxxxxxx>
> >>
> >> The following sequence may leads deadlock in cpu hotplug:
> >>
> >> CPU0 | CPU1
> >> | schedule_work_on
> >> |
> >> _cpu_down//set CPU1 offline |
> >> cpus_write_lock |
> >> | osnoise_hotplug_workfn
> >> | mutex_lock(&interface_lock);
> >> | cpus_read_lock(); //wait cpu_hotplug_lock
> >> |
> >> | cpuhp/1
> >> | osnoise_cpu_die
> >> | kthread_stop
> >> | wait_for_completion //wait osnoise/1 exit
> >> |
> >> | osnoise/1
> >> | osnoise_sleep
> >> | mutex_lock(&interface_lock); //deadlock
> >>
> >> Fix by swap the order of cpus_read_lock() and mutex_lock(&interface_lock).
> >
> >So the deadlock is due to the "wait_for_completion"?
>
> The osnoise_cpu_init callback returns directly, which may allow another CPU offline task to run,
> the offline task holds the cpu_hotplug_lock while waiting for the osnoise task to exit.
> osnoise_hotplug_workfn may acquire interface_lock first, causing the offline task to be blocked.
> This is an ABBA deadlock.
Right, as I said, it is due to the "wait_for_completion" and not due to two
different locks. One is waiting for the osnoise task to exit (the
"wait_for_completion") but the osnoise task is blocked on the interface_lock().
Better to show it as:
task1 task2 task3
----- ----- -----
mutex_lock(&interface_lock)
[CPU GOING OFFLINE]
cpus_write_lock();
osnoise_cpu_die();
kthread_stop(task3);
wait_for_completion();
osnoise_sleep();
mutex_lock(&interface_lock);
cpus_read_lock();
[DEAD LOCK]
>
> >How did you find this bug? Inspection, AI, triggered?
> >
> >Thanks,
> >
> >-- Steve
>
> We run autotests on kernel-6.6, report following hung task warning, and we think the same issue exists
> in linux-stable.
Thanks. It's usually good to state how a bug was discovered when fixing it.
Could you send a v2 with an updated change log?
-- Steve