Re: [PATCH] trace/hwlat: Do not restart per-cpu threads if they are already running

From: Tero Kristo
Date: Thu Mar 02 2023 - 07:02:45 EST


Hi,

On 02/03/2023 13:49, Daniel Bristot de Oliveira wrote:
Hi Tero,

On 3/2/23 08:36, Tero Kristo wrote:
Check if the hwlatd thread for the cpu is already running, before
starting a new one. This avoids running multiple instances of the same
CPU thread on the system. Also, do not wipe the contents of the
per-cpu kthread data when starting the tracer, as this can completely
forget about already running instances and start new additional per-cpu
threads. Fixes issues where fiddling with either the mode of the hwlat
tracer or doing cpu-hotplugs messes up the internal book-keeping
resulting in stale hwlatd threads.
Thanks for your patch.

Would you mind explaining how do you hit the problem? that is, how can
I reproduce the same problem you faced.

For example, this script snippet reproduces it for me every time:

#!/bin/sh
cd /sys/kernel/debug/tracing
echo 0 > tracing_on
echo hwlat > current_tracer
echo per-cpu > hwlat_detector/mode
echo 100000 > hwlat_detector/width
echo 200000 > hwlat_detector/window
echo 200 > tracing_thresh
echo 1 > tracing_on

Another case where something wonky happens is if you offline/online a large number of CPUs (which takes a lot of time), and you start/disable the hwlat tracer at the same time.

-Tero



I tried reproducing it by dispatching the hwlat tracer in two instances,
but the system already blocks me...

[root@vm tracing]# echo hwlat > current_tracer
[root@vm tracing]# cd instances/
[root@vm instances]# mkdir hwlat_2
[root@vm instances]# cd hwlat_2/
[root@vm hwlat_2]# echo hwlat > current_tracer
-bash: echo: write error: Device or resource busy

[root@vm hwlat_2]# cd ../../
[root@vm tracing]# echo nop > current_tracer
[root@vm tracing]# cd instances/hwlat_2/
[root@vm hwlat_2]# echo hwlat > current_tracer
[root@vm hwlat_2]# cd ..
[root@vm instances]# mkdir hwlat_1
[root@vm instances]# cd hwlat_1/
[root@vm hwlat_1]# echo hwlat > current_tracer
-bash: echo: write error: Device or resource busy
[root@vm hwlat_1]#

Having a reproducer helps us to think better about the problem.

-- Daniel

Signed-off-by: Tero Kristo <tero.kristo@xxxxxxxxxxxxxxx>
---
kernel/trace/trace_hwlat.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c
index d440ddd5fd8b..c4945f8adc11 100644
--- a/kernel/trace/trace_hwlat.c
+++ b/kernel/trace/trace_hwlat.c
@@ -492,6 +492,10 @@ static int start_cpu_kthread(unsigned int cpu)
{
struct task_struct *kthread;
+ /* Do not start a new hwlatd thread if it is already running */
+ if (per_cpu(hwlat_per_cpu_data, cpu).kthread)
+ return 0;
+
kthread = kthread_run_on_cpu(kthread_fn, NULL, cpu, "hwlatd/%u");
if (IS_ERR(kthread)) {
pr_err(BANNER "could not start sampling thread\n");
@@ -584,9 +588,6 @@ static int start_per_cpu_kthreads(struct trace_array *tr)
*/
cpumask_and(current_mask, cpu_online_mask, tr->tracing_cpumask);
- for_each_online_cpu(cpu)
- per_cpu(hwlat_per_cpu_data, cpu).kthread = NULL;
-
for_each_cpu(cpu, current_mask) {
retval = start_cpu_kthread(cpu);
if (retval)