Re: perf AUX: race causes poll() hang
From: Konstantin Mikhaylov
Date: Sun Jun 14 2026 - 18:43:37 EST
On 5/18/26 12:41 PM, Peter Zijlstra wrote:
On Fri, May 15, 2026 at 01:35:48AM +0300, Константин Михайлов wrote:
Hello Peter and Adrian,
I'd like to report a potential race condition in perf AUX buffer handling.
AUX tracing is designed to allow the tracee continue running when the AUX
buffer fills. The PMU driver must disable tracing when AUX buffer is full.
Typically, it schedules IRQ work to disable the event later. Meanwhile, a
typical tracer's workflow looks like: poll() on perf FDs, consume the data,
re-enable the event via PERF_EVENT_IOC_ENABLE ioctl(), then poll() again.
Given this, the following race is possible:
-------------------
| CPU #0 | CPU #1 |
| tracee | tracer |
-------------------
| ** | | tracee fills the AUX buffer completely with some data
-------------------
| ** | | PMU driver updates aux_head accordingly and schedules IRQ works
| ** | | to disable the event and wake up the tracer (setting rb->poll in
| ** | | perf_output_wakeup() along the way)
-------------------
| | ** | tracer consumes all the data from AUX buffer,
| | ** | thus clears rb->poll in perf_poll()
-------------------
| | ** | tracer re-enables the tracing (the event is still active,
| | ** | so ioctl(...) returns immediately)
-------------------
| | ** | tracer starts poll()'ing the AUX buffer again
-------------------
| ** | | IRQ work handler finally disables the event and
| ** | | wakes up tracer
-------------------
| | ** | tracer obtains zero rb->poll and continues polling
-------------------
As a result, tracee runs without PMU tracing, and tracer's poll() will
never be woken up unless it has some timeout.
I reproduced this on an x86 machine with intel_pt and kernel v6.17.
Reproducing this race on the vanilla kernel is timing-sensitive, so I added
30 ms delay in the error path in intel_pt_interrupt() when
pt_buffer_reset_markers() returns an error - this delay widens the window
between aux_head update and actual event disable in IRQ work handler. I'm
not sure that pt_buffer_reset_markers()'s error means that buffer
overflowed, but this error branch is taken sometimes and all needed IRQ
works are scheduled during a call to perf_aux_output_end(). I also added 3
ms delay in perf in __auxtrace_mmap__read() before itr->read_finish(),
ensuring the ioctl() falls into that window. With these changes, some perf
runs collected smaller traces than usual. I added traceprints to intel_pt
driver and enabled tracing for sys_poll and sys_ioctl, which confirmed the
exact sequence described above. The problem was sometimes mitigated by
tracee migration to another cpu (as perf creates an event for every cpu,
the event is re-enabled by kernel when it is set on a new cpu). Otherwise,
tracee stayed on the same cpu and tracer hung on poll() until tracee exited.
Could you please confirm if this analysis is correct? Should we move
setting of rb->poll *after* the event is disabled in IRQ work handler?
I *think* (its been a minute since I looked at this code), that you're
right.
Does something like the below cure things?
---
kernel/events/core.c | 54 +++++++++++++++++++++++++++++-----------------------
1 file changed, 30 insertions(+), 24 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7935d5663944..490407618f36 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2677,6 +2677,9 @@ static void __perf_event_disable(struct perf_event *event,
struct perf_event_context *ctx,
void *info)
{
+ if (event->pending_disable)
+ event->pending_disable = 0;
+
if (event->state < PERF_EVENT_STATE_INACTIVE)
return;
@@ -3278,32 +3281,37 @@ static void _perf_event_enable(struct perf_event *event)
{
struct perf_event_context *ctx = event->ctx;
- raw_spin_lock_irq(&ctx->lock);
- if (event->state >= PERF_EVENT_STATE_INACTIVE ||
- event->state < PERF_EVENT_STATE_ERROR) {
-out:
- raw_spin_unlock_irq(&ctx->lock);
- return;
- }
+ scoped_guard (raw_spinlock_irq, &ctx->lock) {
+ if (event->state < PERF_EVENT_STATE_ERROR)
+ return;
- /*
- * If the event is in error state, clear that first.
- *
- * That way, if we see the event in error state below, we know that it
- * has gone back into error state, as distinct from the task having
- * been scheduled away before the cross-call arrived.
- */
- if (event->state == PERF_EVENT_STATE_ERROR) {
/*
- * Detached SIBLING events cannot leave ERROR state.
+ * If the event is in error state, clear that first.
+ *
+ * That way, if we see the event in error state below, we know that it
+ * has gone back into error state, as distinct from the task having
+ * been scheduled away before the cross-call arrived.
*/
- if (event->event_caps & PERF_EV_CAP_SIBLING &&
- event->group_leader == event)
- goto out;
+ if (event->state == PERF_EVENT_STATE_ERROR) {
+ /*
+ * Detached SIBLING events cannot leave ERROR state.
+ */
+ if (event->event_caps & PERF_EV_CAP_SIBLING &&
+ event->group_leader == event)
+ return;
- event->state = PERF_EVENT_STATE_OFF;
+ event->state = PERF_EVENT_STATE_OFF;
+ }
+
+ if (event->pending_disable)
+ event->pending_disable = 0;
+
+ /*
+ * Already running, nothing to do.
+ */
+ if (event->state >= PERF_EVENT_STATE_INACTIVE)
+ return;
}
- raw_spin_unlock_irq(&ctx->lock);
event_function_call(event, __perf_event_enable, NULL);
}
@@ -7612,10 +7620,8 @@ static void __perf_pending_disable(struct perf_event *event)
* Yay, we hit home and are in the context of the event.
*/
if (cpu == smp_processor_id()) {
- if (event->pending_disable) {
- event->pending_disable = 0;
+ if (event->pending_disable)
perf_event_disable_local(event);
- }
return;
}
Sorry, I'm re-sending this response because my previous email (sent 3+ weeks ago) was not properly formatted as plain text and was not accepted by lkml.org. To ensure this reaches the thread, I'm re-sending it correctly formatted now.
I tested this patch with the same delay on the error path in intel_pt driver. With this patch, the hardware tracing sometimes remains stopped until the next reschedule. The tracer does not re-enable the event when event->state >= PERF_EVENT_STATE_INACTIVE, since this condition is true while the driver is disabling hardware tracing. As a result, hardware tracing can stay disabled until the next reschedule even when the tracer has drained all data from the AUX buffer.