Re: [PATCH v3] perf/core: Fix missing read event generation on task exit

From: Peter Zijlstra

Date: Fri Feb 06 2026 - 10:29:46 EST


On Fri, Feb 06, 2026 at 11:21:19AM +0000, James Clark wrote:

> I've been looking into a regression caused by this commit and didn't manage
> to come up with a fix. But shouldn't this be something more like:
>
> if (attach_state & PERF_ATTACH_CHILD && event_filter_match(event))
> sync_child_event(event, task);
>
> As in, you only want to call sync_child_event() and write stuff to the ring
> buffer for the CPU that is currently running this exit handler? Although
> this change affects the 'total_time_enabled' tracking as well, but I'm not
> 100% sure if we're not double counting it anyway.
>
> From perf_event_exit_task_context(), perf_event_exit_event() is called on
> all events, which includes events on other CPUs:
>
> list_for_each_entry_safe(child_event, next, &ctx->event_list, ...)
> perf_event_exit_event(child_event, ctx, exit ? task : NULL, false);
>
> Then we write into those other CPU's ring buffers, which don't support
> concurrency.
>
> The reason I found this is because we have a tracing test that spawns some
> threads and then looks for PERF_RECORD_AUX events. When there are concurrent
> writes into the ring buffers, rb->nest tracking gets messed up leaving the
> count positive even after all nested writers have finished. Then all future
> writes don't copy the data_head pointer to the user page (because it thinks
> someone else is writing), so Perf doesn't copy out any data anymore leaving
> records missing.
>
> An easy reproducer is to put a warning that the ring buffer being written to
> is the correct one:
>
> @@ -41,10 +41,11 @@ static void perf_output_get_handle(struct
> perf_output_handle *handle)
> {
> struct perf_buffer *rb = handle->rb;
>
> preempt_disable();
>
> + WARN_ON(handle->event->cpu != smp_processor_id());
>
>
> And then record:
>
> perf record -s -- stress -c 8 -t 1
>
> Which results in:
>
> perf_output_begin+0x320/0x480 (P)
> perf_event_exit_event+0x178/0x2c0
> perf_event_exit_task_context+0x214/0x2f0
> perf_event_exit_task+0xb0/0x3b0
> do_exit+0x1bc/0x808
> __arm64_sys_exit+0x28/0x30
> invoke_syscall+0x4c/0xe8
> el0_svc_common+0x9c/0xf0
> do_el0_svc+0x28/0x40
> el0_svc+0x50/0x240
> el0t_64_sync_handler+0x78/0x130
> el0t_64_sync+0x198/0x1a0
>
> I suppose there is a chance that this is only an issue when also doing
> perf_aux_output_begin()/perf_aux_output_end() from start/stop because that's
> where I saw the real race? Maybe without that, accessing the rb from another
> CPU is ok because there is some locking, but I think this might be a more
> general issue.

I *think* something like so.

Before the patch in question this would never happen, because of calling
things too late and always hitting that TASK_TOMBSTONE.

But irrespective of emitting that event, we do want to propagate the
count and runtime numbers.


---
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5b5cb620499e..f566ad55b4fb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -14086,7 +14086,7 @@ static void sync_child_event(struct perf_event *child_event,
u64 child_val;

if (child_event->attr.inherit_stat) {
- if (task && task != TASK_TOMBSTONE)
+ if (task && task != TASK_TOMBSTONE && event_filter_match(child_event))
perf_event_read_event(child_event, task);
}