Re: perf AUX: race causes poll() hang
From: Adrian Hunter
Date: Tue May 19 2026 - 01:59:07 EST
On 18/05/2026 12:41, Peter Zijlstra wrote:
> On Fri, May 15, 2026 at 01:35:48AM +0300, Константин Михайлов wrote:
>> Hello Peter and Adrian,
>>
>> I'd like to report a potential race condition in perf AUX buffer handling.
>>
>> AUX tracing is designed to allow the tracee continue running when the AUX
>> buffer fills. The PMU driver must disable tracing when AUX buffer is full.
>> Typically, it schedules IRQ work to disable the event later. Meanwhile, a
>> typical tracer's workflow looks like: poll() on perf FDs, consume the data,
>> re-enable the event via PERF_EVENT_IOC_ENABLE ioctl(), then poll() again.
>>
>> Given this, the following race is possible:
>> -------------------
>> | CPU #0 | CPU #1 |
>> | tracee | tracer |
>> -------------------
>> | ** | | tracee fills the AUX buffer completely with some data
>> -------------------
>> | ** | | PMU driver updates aux_head accordingly and schedules IRQ works
>> | ** | | to disable the event and wake up the tracer (setting rb->poll in
>> | ** | | perf_output_wakeup() along the way)
>> -------------------
>> | | ** | tracer consumes all the data from AUX buffer,
>> | | ** | thus clears rb->poll in perf_poll()
>> -------------------
>> | | ** | tracer re-enables the tracing (the event is still active,
>> | | ** | so ioctl(...) returns immediately)
>> -------------------
>> | | ** | tracer starts poll()'ing the AUX buffer again
>> -------------------
>> | ** | | IRQ work handler finally disables the event and
>> | ** | | wakes up tracer
>> -------------------
>> | | ** | tracer obtains zero rb->poll and continues polling
>> -------------------
>> As a result, tracee runs without PMU tracing, and tracer's poll() will
>> never be woken up unless it has some timeout.
>>
>> I reproduced this on an x86 machine with intel_pt and kernel v6.17.
>> Reproducing this race on the vanilla kernel is timing-sensitive, so I added
>> 30 ms delay in the error path in intel_pt_interrupt() when
>> pt_buffer_reset_markers() returns an error - this delay widens the window
>> between aux_head update and actual event disable in IRQ work handler. I'm
>> not sure that pt_buffer_reset_markers()'s error means that buffer
>> overflowed, but this error branch is taken sometimes and all needed IRQ
>> works are scheduled during a call to perf_aux_output_end(). I also added 3
>> ms delay in perf in __auxtrace_mmap__read() before itr->read_finish(),
>> ensuring the ioctl() falls into that window. With these changes, some perf
>> runs collected smaller traces than usual. I added traceprints to intel_pt
>> driver and enabled tracing for sys_poll and sys_ioctl, which confirmed the
>> exact sequence described above. The problem was sometimes mitigated by
>> tracee migration to another cpu (as perf creates an event for every cpu,
>> the event is re-enabled by kernel when it is set on a new cpu). Otherwise,
>> tracee stayed on the same cpu and tracer hung on poll() until tracee exited.
>>
>> Could you please confirm if this analysis is correct? Should we move
>> setting of rb->poll *after* the event is disabled in IRQ work handler?
>
> I *think* (its been a minute since I looked at this code), that you're
> right.
>
> Does something like the below cure things?
>
> ---
> kernel/events/core.c | 54 +++++++++++++++++++++++++++++-----------------------
> 1 file changed, 30 insertions(+), 24 deletions(-)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 7935d5663944..490407618f36 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2677,6 +2677,9 @@ static void __perf_event_disable(struct perf_event *event,
> struct perf_event_context *ctx,
> void *info)
> {
> + if (event->pending_disable)
> + event->pending_disable = 0;
> +
> if (event->state < PERF_EVENT_STATE_INACTIVE)
> return;
>
> @@ -3278,32 +3281,37 @@ static void _perf_event_enable(struct perf_event *event)
> {
> struct perf_event_context *ctx = event->ctx;
>
> - raw_spin_lock_irq(&ctx->lock);
> - if (event->state >= PERF_EVENT_STATE_INACTIVE ||
> - event->state < PERF_EVENT_STATE_ERROR) {
> -out:
> - raw_spin_unlock_irq(&ctx->lock);
> - return;
> - }
> + scoped_guard (raw_spinlock_irq, &ctx->lock) {
> + if (event->state < PERF_EVENT_STATE_ERROR)
> + return;
>
> - /*
> - * If the event is in error state, clear that first.
> - *
> - * That way, if we see the event in error state below, we know that it
> - * has gone back into error state, as distinct from the task having
> - * been scheduled away before the cross-call arrived.
> - */
> - if (event->state == PERF_EVENT_STATE_ERROR) {
> /*
> - * Detached SIBLING events cannot leave ERROR state.
> + * If the event is in error state, clear that first.
> + *
> + * That way, if we see the event in error state below, we know that it
> + * has gone back into error state, as distinct from the task having
> + * been scheduled away before the cross-call arrived.
> */
> - if (event->event_caps & PERF_EV_CAP_SIBLING &&
> - event->group_leader == event)
> - goto out;
> + if (event->state == PERF_EVENT_STATE_ERROR) {
> + /*
> + * Detached SIBLING events cannot leave ERROR state.
> + */
> + if (event->event_caps & PERF_EV_CAP_SIBLING &&
> + event->group_leader == event)
> + return;
>
> - event->state = PERF_EVENT_STATE_OFF;
> + event->state = PERF_EVENT_STATE_OFF;
> + }
> +
> + if (event->pending_disable)
> + event->pending_disable = 0;
> +
> + /*
> + * Already running, nothing to do.
> + */
> + if (event->state >= PERF_EVENT_STATE_INACTIVE)
> + return;
> }
> - raw_spin_unlock_irq(&ctx->lock);
>
> event_function_call(event, __perf_event_enable, NULL);
> }
> @@ -7612,10 +7620,8 @@ static void __perf_pending_disable(struct perf_event *event)
> * Yay, we hit home and are in the context of the event.
> */
> if (cpu == smp_processor_id()) {
> - if (event->pending_disable) {
> - event->pending_disable = 0;
> + if (event->pending_disable)
At this point, can _perf_event_enable() on another CPU decide
to do nothing ("Already running, nothing to do"), but then the
disable below contradicts that?
> perf_event_disable_local(event);
> - }
> return;
> }
>