Re: perf AUX: race causes poll() hang
From: Peter Zijlstra
Date: Mon May 18 2026 - 05:42:51 EST
On Fri, May 15, 2026 at 01:35:48AM +0300, Константин Михайлов wrote:
> Hello Peter and Adrian,
>
> I'd like to report a potential race condition in perf AUX buffer handling.
>
> AUX tracing is designed to allow the tracee continue running when the AUX
> buffer fills. The PMU driver must disable tracing when AUX buffer is full.
> Typically, it schedules IRQ work to disable the event later. Meanwhile, a
> typical tracer's workflow looks like: poll() on perf FDs, consume the data,
> re-enable the event via PERF_EVENT_IOC_ENABLE ioctl(), then poll() again.
>
> Given this, the following race is possible:
> -------------------
> | CPU #0 | CPU #1 |
> | tracee | tracer |
> -------------------
> | ** | | tracee fills the AUX buffer completely with some data
> -------------------
> | ** | | PMU driver updates aux_head accordingly and schedules IRQ works
> | ** | | to disable the event and wake up the tracer (setting rb->poll in
> | ** | | perf_output_wakeup() along the way)
> -------------------
> | | ** | tracer consumes all the data from AUX buffer,
> | | ** | thus clears rb->poll in perf_poll()
> -------------------
> | | ** | tracer re-enables the tracing (the event is still active,
> | | ** | so ioctl(...) returns immediately)
> -------------------
> | | ** | tracer starts poll()'ing the AUX buffer again
> -------------------
> | ** | | IRQ work handler finally disables the event and
> | ** | | wakes up tracer
> -------------------
> | | ** | tracer obtains zero rb->poll and continues polling
> -------------------
> As a result, tracee runs without PMU tracing, and tracer's poll() will
> never be woken up unless it has some timeout.
>
> I reproduced this on an x86 machine with intel_pt and kernel v6.17.
> Reproducing this race on the vanilla kernel is timing-sensitive, so I added
> 30 ms delay in the error path in intel_pt_interrupt() when
> pt_buffer_reset_markers() returns an error - this delay widens the window
> between aux_head update and actual event disable in IRQ work handler. I'm
> not sure that pt_buffer_reset_markers()'s error means that buffer
> overflowed, but this error branch is taken sometimes and all needed IRQ
> works are scheduled during a call to perf_aux_output_end(). I also added 3
> ms delay in perf in __auxtrace_mmap__read() before itr->read_finish(),
> ensuring the ioctl() falls into that window. With these changes, some perf
> runs collected smaller traces than usual. I added traceprints to intel_pt
> driver and enabled tracing for sys_poll and sys_ioctl, which confirmed the
> exact sequence described above. The problem was sometimes mitigated by
> tracee migration to another cpu (as perf creates an event for every cpu,
> the event is re-enabled by kernel when it is set on a new cpu). Otherwise,
> tracee stayed on the same cpu and tracer hung on poll() until tracee exited.
>
> Could you please confirm if this analysis is correct? Should we move
> setting of rb->poll *after* the event is disabled in IRQ work handler?
I *think* (its been a minute since I looked at this code), that you're
right.
Does something like the below cure things?
---
kernel/events/core.c | 54 +++++++++++++++++++++++++++++-----------------------
1 file changed, 30 insertions(+), 24 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7935d5663944..490407618f36 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2677,6 +2677,9 @@ static void __perf_event_disable(struct perf_event *event,
struct perf_event_context *ctx,
void *info)
{
+ if (event->pending_disable)
+ event->pending_disable = 0;
+
if (event->state < PERF_EVENT_STATE_INACTIVE)
return;
@@ -3278,32 +3281,37 @@ static void _perf_event_enable(struct perf_event *event)
{
struct perf_event_context *ctx = event->ctx;
- raw_spin_lock_irq(&ctx->lock);
- if (event->state >= PERF_EVENT_STATE_INACTIVE ||
- event->state < PERF_EVENT_STATE_ERROR) {
-out:
- raw_spin_unlock_irq(&ctx->lock);
- return;
- }
+ scoped_guard (raw_spinlock_irq, &ctx->lock) {
+ if (event->state < PERF_EVENT_STATE_ERROR)
+ return;
- /*
- * If the event is in error state, clear that first.
- *
- * That way, if we see the event in error state below, we know that it
- * has gone back into error state, as distinct from the task having
- * been scheduled away before the cross-call arrived.
- */
- if (event->state == PERF_EVENT_STATE_ERROR) {
/*
- * Detached SIBLING events cannot leave ERROR state.
+ * If the event is in error state, clear that first.
+ *
+ * That way, if we see the event in error state below, we know that it
+ * has gone back into error state, as distinct from the task having
+ * been scheduled away before the cross-call arrived.
*/
- if (event->event_caps & PERF_EV_CAP_SIBLING &&
- event->group_leader == event)
- goto out;
+ if (event->state == PERF_EVENT_STATE_ERROR) {
+ /*
+ * Detached SIBLING events cannot leave ERROR state.
+ */
+ if (event->event_caps & PERF_EV_CAP_SIBLING &&
+ event->group_leader == event)
+ return;
- event->state = PERF_EVENT_STATE_OFF;
+ event->state = PERF_EVENT_STATE_OFF;
+ }
+
+ if (event->pending_disable)
+ event->pending_disable = 0;
+
+ /*
+ * Already running, nothing to do.
+ */
+ if (event->state >= PERF_EVENT_STATE_INACTIVE)
+ return;
}
- raw_spin_unlock_irq(&ctx->lock);
event_function_call(event, __perf_event_enable, NULL);
}
@@ -7612,10 +7620,8 @@ static void __perf_pending_disable(struct perf_event *event)
* Yay, we hit home and are in the context of the event.
*/
if (cpu == smp_processor_id()) {
- if (event->pending_disable) {
- event->pending_disable = 0;
+ if (event->pending_disable)
perf_event_disable_local(event);
- }
return;
}