Re: [PATCH v3] perf/core: Fix missing read event generation on task exit
From: James Clark
Date: Mon Feb 09 2026 - 11:12:46 EST
On 06/02/2026 3:29 pm, Peter Zijlstra wrote:
On Fri, Feb 06, 2026 at 11:21:19AM +0000, James Clark wrote:
I've been looking into a regression caused by this commit and didn't manage
to come up with a fix. But shouldn't this be something more like:
if (attach_state & PERF_ATTACH_CHILD && event_filter_match(event))
sync_child_event(event, task);
As in, you only want to call sync_child_event() and write stuff to the ring
buffer for the CPU that is currently running this exit handler? Although
this change affects the 'total_time_enabled' tracking as well, but I'm not
100% sure if we're not double counting it anyway.
From perf_event_exit_task_context(), perf_event_exit_event() is called on
all events, which includes events on other CPUs:
list_for_each_entry_safe(child_event, next, &ctx->event_list, ...)
perf_event_exit_event(child_event, ctx, exit ? task : NULL, false);
Then we write into those other CPU's ring buffers, which don't support
concurrency.
The reason I found this is because we have a tracing test that spawns some
threads and then looks for PERF_RECORD_AUX events. When there are concurrent
writes into the ring buffers, rb->nest tracking gets messed up leaving the
count positive even after all nested writers have finished. Then all future
writes don't copy the data_head pointer to the user page (because it thinks
someone else is writing), so Perf doesn't copy out any data anymore leaving
records missing.
An easy reproducer is to put a warning that the ring buffer being written to
is the correct one:
@@ -41,10 +41,11 @@ static void perf_output_get_handle(struct
perf_output_handle *handle)
{
struct perf_buffer *rb = handle->rb;
preempt_disable();
+ WARN_ON(handle->event->cpu != smp_processor_id());
And then record:
perf record -s -- stress -c 8 -t 1
Which results in:
perf_output_begin+0x320/0x480 (P)
perf_event_exit_event+0x178/0x2c0
perf_event_exit_task_context+0x214/0x2f0
perf_event_exit_task+0xb0/0x3b0
do_exit+0x1bc/0x808
__arm64_sys_exit+0x28/0x30
invoke_syscall+0x4c/0xe8
el0_svc_common+0x9c/0xf0
do_el0_svc+0x28/0x40
el0_svc+0x50/0x240
el0t_64_sync_handler+0x78/0x130
el0t_64_sync+0x198/0x1a0
I suppose there is a chance that this is only an issue when also doing
perf_aux_output_begin()/perf_aux_output_end() from start/stop because that's
where I saw the real race? Maybe without that, accessing the rb from another
CPU is ok because there is some locking, but I think this might be a more
general issue.
I *think* something like so.
Before the patch in question this would never happen, because of calling
things too late and always hitting that TASK_TOMBSTONE.
But irrespective of emitting that event, we do want to propagate the
count and runtime numbers.
---
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5b5cb620499e..f566ad55b4fb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -14086,7 +14086,7 @@ static void sync_child_event(struct perf_event *child_event,
u64 child_val;
if (child_event->attr.inherit_stat) {
- if (task && task != TASK_TOMBSTONE)
+ if (task && task != TASK_TOMBSTONE && event_filter_match(child_event))
perf_event_read_event(child_event, task);
}
Turns out I tested this before with "child_event->cpu == raw_smp_processor_id()" rather than using event_filter_match() so I missed that the loop over all the events needs to be wrapped with preempt_disable(). But that can't be done because perf_event_exit_event() takes a mutex.
I don't think the preempt_disable() can be on any smaller region than outside the entire loop otherwise you can get rescheduled between event_filter_match() checks and end up failing them all and not writing any event out at all.
While debugging I also noticed another issue with these per-thread count records. perf_event_exit_event() only does anything if the event has a parent. But the context switch optimization means that sometimes threads re-use the original event which has no parent. So randomly you get threads that are missing from the output.
There is a comment that mentions this under the parent check:
if (parent_event) {
/*
* Do not destroy the 'original' grouping; because of the
* context switch optimization the original events could've
* ended up in a random child task.
But I'm not sure if that was supposed to be worked around some other way and it's now broken, or it was a known limitation of the implementation from the beginning? But right now it randomly misses one of the threads and includes the main thread counts, or includes all the threads and doesn't include the main thread counts if no context switch optimisation was done.
The perf record docs don't say anything that you wouldn't expect all threads to be there:
-s, --stat per thread counts