[GIT PULL] tracing: Fixes for 7.0

From: Steven Rostedt

Date: Thu Mar 05 2026 - 10:47:21 EST



Linus,

tracing fixes for v7.0:

- Fix thresh_return of function graph tracer

The update to store data on the shadow stack removed the abuse of
using the task recursion word as a way to keep track of what functions
to ignore. The trace_graph_return() was updated to handle this, but
when function_graph tracer is using a threshold (only trace functions
that took longer than a specified time), it uses
trace_graph_thresh_return() instead. This function was still incorrectly
using the task struct recursion word causing the function graph tracer to
permanently set all functions to "notrace"

- Fix thresh_return nosleep accounting

When the calltime was moved to the shadow stack storage instead of being
on the fgraph descriptor, the calculations for the amount of sleep time
was updated. The calculation was done in the trace_graph_thresh_return()
function, which also called the trace_graph_return(), which did the
calculation again, causing the time to be doubled.

Remove the call to trace_graph_return() as what it needed to do wasn't
that much, and just do the work in trace_graph_thresh_return().

- Fix syscall trace event activation on boot up

The syscall trace events are pseudo events attached to the raw_syscall
tracepoints. When the first syscall event is enabled, it enables the
raw_syscall tracepoint and doesn't need to do anything when a second
syscall event is also enabled.

When events are enabled via the kernel command line, syscall events
are partially enabled as the enabling is called before rcu_init.
This is due to allow early events to be enabled immediately. Because
kernel command line events do not distinguish between different
types of events, the syscall events are enabled here but are not fully
functioning. After rcu_init, they are disabled and re-enabled so that
they can be fully enabled. The problem happened is that this
"disable-enable" is done one at a time. If more than one syscall event
is specified on the command line, by disabling them one at a time,
the counter never gets to zero, and the raw_syscall is not disabled and
enabled, keeping the syscall events in their non-fully functional state.

Instead, disable all events and re-enabled them all, as that will ensure
the raw_syscall event is also disabled and re-enabled.

- Disable preemption in ftrace pid filtering

The ftrace pid filtering attaches to the fork and exit tracepoints to
add or remove pids that should be traced. They access variables protected
by RCU (preemption disabled). Now that tracepoint callbacks are called with
preemption enabled, this protection needs to be added explicitly, and
not depend on the functions being called with preemption disabled.

- Disable preemption in event pid filtering

The event pid filtering needs the same preemption disabling guards as
ftrace pid filtering.

- Fix accounting of the memory mapped ring buffer on fork

Memory mapping the ftrace ring buffer sets the vm_flags to DONTCOPY. But
this does not prevent the application from calling madvise(MADVISE_DOFORK).
This causes the mapping to be copied on fork. After the first tasks exits,
the mapping is considered unmapped by everyone. But when he second task
exits, the counter goes below zero and triggers a WARN_ON.

Since nothing prevents two separate tasks from mmapping the ftrace ring
buffer (although two mappings may mess each other up), there's no reason
to stop the memory from being copied on fork.

Update the vm_operations to have an ".open" handler to update the
accounting and let the ring buffer know someone else has it mapped.

- Add all ftrace headers in MAINTAINERS file

The MAINTAINERS file only specifies include/linux/ftrace.h But misses
ftrace_irq.h and ftrace_regs.h. Make the file use wildcards to get all
*ftrace* files.


Please pull the latest trace-v7.0-rc2 tree, which can be found at:


git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
trace-v7.0-rc2

Tag SHA1: eb510ef50609f82ad56503c2fddb8e40b9b8ad3c
Head SHA1: f26b098d937488e8f5c617d465760a10bfcc7f13


Huiwen He (1):
tracing: Fix syscall events activation by ensuring refcount hits zero

Jerome Marchand (1):
ftrace: Add MAINTAINERS entries for all ftrace headers

Masami Hiramatsu (Google) (1):
tracing: Disable preemption in the tracepoint callbacks handling filtered pids

Qing Wang (1):
tracing: Fix WARN_ON in tracing_buffers_mmap_close

Shengming Hu (2):
fgraph: Fix thresh_return clear per-task notrace
fgraph: Fix thresh_return nosleeptime double-adjust

Steven Rostedt (1):
ftrace: Disable preemption in the tracepoint callbacks handling filtered pids

----
MAINTAINERS | 2 +-
include/linux/ring_buffer.h | 1 +
kernel/trace/ftrace.c | 2 ++
kernel/trace/ring_buffer.c | 21 ++++++++++++++
kernel/trace/trace.c | 13 +++++++++
kernel/trace/trace_events.c | 54 ++++++++++++++++++++++++++----------
kernel/trace/trace_functions_graph.c | 19 +++++++++----
7 files changed, 90 insertions(+), 22 deletions(-)
---------------------------
diff --git a/MAINTAINERS b/MAINTAINERS
index 61bf550fd37c..b8d1ad952827 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10484,7 +10484,7 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
F: Documentation/trace/ftrace*
F: arch/*/*/*/*ftrace*
F: arch/*/*/*ftrace*
-F: include/*/ftrace.h
+F: include/*/*ftrace*
F: kernel/trace/fgraph.c
F: kernel/trace/ftrace*
F: samples/ftrace
diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 876358cfe1b1..d862fa610270 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -248,6 +248,7 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node);

int ring_buffer_map(struct trace_buffer *buffer, int cpu,
struct vm_area_struct *vma);
+void ring_buffer_map_dup(struct trace_buffer *buffer, int cpu);
int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
#endif /* _LINUX_RING_BUFFER_H */
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 827fb9a0bf0d..2f72af0357e5 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -8611,6 +8611,7 @@ ftrace_pid_follow_sched_process_fork(void *data,
struct trace_pid_list *pid_list;
struct trace_array *tr = data;

+ guard(preempt)();
pid_list = rcu_dereference_sched(tr->function_pids);
trace_filter_add_remove_task(pid_list, self, task);

@@ -8624,6 +8625,7 @@ ftrace_pid_follow_sched_process_exit(void *data, struct task_struct *task)
struct trace_pid_list *pid_list;
struct trace_array *tr = data;

+ guard(preempt)();
pid_list = rcu_dereference_sched(tr->function_pids);
trace_filter_add_remove_task(pid_list, NULL, task);

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index f16f053ef77d..17d0ea0cc3e6 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -7310,6 +7310,27 @@ int ring_buffer_map(struct trace_buffer *buffer, int cpu,
return err;
}

+/*
+ * This is called when a VMA is duplicated (e.g., on fork()) to increment
+ * the user_mapped counter without remapping pages.
+ */
+void ring_buffer_map_dup(struct trace_buffer *buffer, int cpu)
+{
+ struct ring_buffer_per_cpu *cpu_buffer;
+
+ if (WARN_ON(!cpumask_test_cpu(cpu, buffer->cpumask)))
+ return;
+
+ cpu_buffer = buffer->buffers[cpu];
+
+ guard(mutex)(&cpu_buffer->mapping_lock);
+
+ if (cpu_buffer->user_mapped)
+ __rb_inc_dec_mapped(cpu_buffer, true);
+ else
+ WARN(1, "Unexpected buffer stat, it should be mapped");
+}
+
int ring_buffer_unmap(struct trace_buffer *buffer, int cpu)
{
struct ring_buffer_per_cpu *cpu_buffer;
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 23de3719f495..1e7c032a72d2 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8213,6 +8213,18 @@ static inline int get_snapshot_map(struct trace_array *tr) { return 0; }
static inline void put_snapshot_map(struct trace_array *tr) { }
#endif

+/*
+ * This is called when a VMA is duplicated (e.g., on fork()) to increment
+ * the user_mapped counter without remapping pages.
+ */
+static void tracing_buffers_mmap_open(struct vm_area_struct *vma)
+{
+ struct ftrace_buffer_info *info = vma->vm_file->private_data;
+ struct trace_iterator *iter = &info->iter;
+
+ ring_buffer_map_dup(iter->array_buffer->buffer, iter->cpu_file);
+}
+
static void tracing_buffers_mmap_close(struct vm_area_struct *vma)
{
struct ftrace_buffer_info *info = vma->vm_file->private_data;
@@ -8232,6 +8244,7 @@ static int tracing_buffers_may_split(struct vm_area_struct *vma, unsigned long a
}

static const struct vm_operations_struct tracing_buffers_vmops = {
+ .open = tracing_buffers_mmap_open,
.close = tracing_buffers_mmap_close,
.may_split = tracing_buffers_may_split,
};
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 9928da636c9d..b7343fdfd7b0 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -1039,6 +1039,7 @@ event_filter_pid_sched_process_exit(void *data, struct task_struct *task)
struct trace_pid_list *pid_list;
struct trace_array *tr = data;

+ guard(preempt)();
pid_list = rcu_dereference_raw(tr->filtered_pids);
trace_filter_add_remove_task(pid_list, NULL, task);

@@ -1054,6 +1055,7 @@ event_filter_pid_sched_process_fork(void *data,
struct trace_pid_list *pid_list;
struct trace_array *tr = data;

+ guard(preempt)();
pid_list = rcu_dereference_sched(tr->filtered_pids);
trace_filter_add_remove_task(pid_list, self, task);

@@ -4668,26 +4670,22 @@ static __init int event_trace_memsetup(void)
return 0;
}

-__init void
-early_enable_events(struct trace_array *tr, char *buf, bool disable_first)
+/*
+ * Helper function to enable or disable a comma-separated list of events
+ * from the bootup buffer.
+ */
+static __init void __early_set_events(struct trace_array *tr, char *buf, bool enable)
{
char *token;
- int ret;
-
- while (true) {
- token = strsep(&buf, ",");
-
- if (!token)
- break;

+ while ((token = strsep(&buf, ","))) {
if (*token) {
- /* Restarting syscalls requires that we stop them first */
- if (disable_first)
+ if (enable) {
+ if (ftrace_set_clr_event(tr, token, 1))
+ pr_warn("Failed to enable trace event: %s\n", token);
+ } else {
ftrace_set_clr_event(tr, token, 0);
-
- ret = ftrace_set_clr_event(tr, token, 1);
- if (ret)
- pr_warn("Failed to enable trace event: %s\n", token);
+ }
}

/* Put back the comma to allow this to be called again */
@@ -4696,6 +4694,32 @@ early_enable_events(struct trace_array *tr, char *buf, bool disable_first)
}
}

+/**
+ * early_enable_events - enable events from the bootup buffer
+ * @tr: The trace array to enable the events in
+ * @buf: The buffer containing the comma separated list of events
+ * @disable_first: If true, disable all events in @buf before enabling them
+ *
+ * This function enables events from the bootup buffer. If @disable_first
+ * is true, it will first disable all events in the buffer before enabling
+ * them.
+ *
+ * For syscall events, which rely on a global refcount to register the
+ * SYSCALL_WORK_SYSCALL_TRACEPOINT flag (especially for pid 1), we must
+ * ensure the refcount hits zero before re-enabling them. A simple
+ * "disable then enable" per-event is not enough if multiple syscalls are
+ * used, as the refcount will stay above zero. Thus, we need a two-phase
+ * approach: disable all, then enable all.
+ */
+__init void
+early_enable_events(struct trace_array *tr, char *buf, bool disable_first)
+{
+ if (disable_first)
+ __early_set_events(tr, buf, false);
+
+ __early_set_events(tr, buf, true);
+}
+
static __init int event_trace_enable(void)
{
struct trace_array *tr = top_trace_array();
diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
index 3d8239fee004..0d2d3a2ea7dd 100644
--- a/kernel/trace/trace_functions_graph.c
+++ b/kernel/trace/trace_functions_graph.c
@@ -400,14 +400,19 @@ static void trace_graph_thresh_return(struct ftrace_graph_ret *trace,
struct fgraph_ops *gops,
struct ftrace_regs *fregs)
{
+ unsigned long *task_var = fgraph_get_task_var(gops);
struct fgraph_times *ftimes;
struct trace_array *tr;
+ unsigned int trace_ctx;
+ u64 calltime, rettime;
int size;

+ rettime = trace_clock_local();
+
ftrace_graph_addr_finish(gops, trace);

- if (trace_recursion_test(TRACE_GRAPH_NOTRACE_BIT)) {
- trace_recursion_clear(TRACE_GRAPH_NOTRACE_BIT);
+ if (*task_var & TRACE_GRAPH_NOTRACE) {
+ *task_var &= ~TRACE_GRAPH_NOTRACE;
return;
}

@@ -418,11 +423,13 @@ static void trace_graph_thresh_return(struct ftrace_graph_ret *trace,
tr = gops->private;
handle_nosleeptime(tr, trace, ftimes, size);

- if (tracing_thresh &&
- (trace_clock_local() - ftimes->calltime < tracing_thresh))
+ calltime = ftimes->calltime;
+
+ if (tracing_thresh && (rettime - calltime < tracing_thresh))
return;
- else
- trace_graph_return(trace, gops, fregs);
+
+ trace_ctx = tracing_gen_ctx();
+ __trace_graph_return(tr, trace, trace_ctx, calltime, rettime);
}

static struct fgraph_ops funcgraph_ops = {