Re: [PATCH v2] perf/x86: Move event pointer setup earlier in x86_pmu_enable()

From: Mi, Dapeng

Date: Wed Mar 11 2026 - 21:06:18 EST

On 3/12/2026 1:18 AM, Peter Zijlstra wrote:
> On Wed, Mar 11, 2026 at 10:04:10AM +0800, Mi, Dapeng wrote:
>> On 3/10/2026 6:13 PM, Breno Leitao wrote:
>>> A production AMD EPYC system crashed with a NULL pointer dereference
>>> in the PMU NMI handler:
>>>
>>> BUG: kernel NULL pointer dereference, address: 0000000000000198
>>> RIP: x86_perf_event_update+0xc/0xa0
>>> Call Trace:
>>> <NMI>
>>> amd_pmu_v2_handle_irq+0x1a6/0x390
>>> perf_event_nmi_handler+0x24/0x40
>>>
>>> The faulting instruction is `cmpq $0x0, 0x198(%rdi)` with RDI=0,
>>> corresponding to the `if (unlikely(!hwc->event_base))` check in
>>> x86_perf_event_update() where hwc = &event->hw and event is NULL.
>>>
>>> drgn inspection of the vmcore on CPU 106 showed a mismatch between
>>> cpuc->active_mask and cpuc->events[]:
>>>
>>> active_mask: 0x1e (bits 1, 2, 3, 4)
>>> events[1]: 0xff1100136cbd4f38 (valid)
>>> events[2]: 0x0 (NULL, but active_mask bit 2 set)
>>> events[3]: 0xff1100076fd2cf38 (valid)
>>> events[4]: 0xff1100079e990a90 (valid)
>>>
>>> The event that should occupy events[2] was found in event_list[2]
>>> with hw.idx=2 and hw.state=0x0, confirming x86_pmu_start() had run
>>> (which clears hw.state and sets active_mask) but events[2] was
>>> never populated.
>>>
>>> Another event (event_list[0]) had hw.state=0x7 (STOPPED|UPTODATE|ARCH),
>>> showing it was stopped when the PMU rescheduled events, confirming the
>>> throttle-then-reschedule sequence occurred.
>>>
>>> The root cause is commit 7e772a93eb61 ("perf/x86: Fix NULL event access
>>> and potential PEBS record loss") which moved the cpuc->events[idx]
>>> assignment out of x86_pmu_start() and into step 2 of x86_pmu_enable(),
>>> after the PERF_HES_ARCH check. This broke any path that calls
>>> pmu->start() without going through x86_pmu_enable() -- specifically
>>> the unthrottle path:
>>>
>>> perf_adjust_freq_unthr_events()
>>> -> perf_event_unthrottle_group()
>>> -> perf_event_unthrottle()
>>> -> event->pmu->start(event, 0)
>>> -> x86_pmu_start() // sets active_mask but not events[]
>>>
>>> The race sequence is:
>>>
>>> 1. A group of perf events overflows, triggering group throttle via
>>> perf_event_throttle_group(). All events are stopped: active_mask
>>> bits cleared, events[] preserved (x86_pmu_stop no longer clears
>>> events[] after commit 7e772a93eb61).
>>>
>>> 2. While still throttled (PERF_HES_STOPPED), x86_pmu_enable() runs
>>> due to other scheduling activity. Stopped events that need to
>>> move counters get PERF_HES_ARCH set and events[old_idx] cleared.
>>> In step 2 of x86_pmu_enable(), PERF_HES_ARCH causes these events
>>> to be skipped -- events[new_idx] is never set.
>>>
>>> 3. The timer tick unthrottles the group via pmu->start(). Since
>>> commit 7e772a93eb61 removed the events[] assignment from
>>> x86_pmu_start(), active_mask[new_idx] is set but events[new_idx]
>>> remains NULL.
>>>
>>> 4. A PMC overflow NMI fires. The handler iterates active counters,
>>> finds active_mask[2] set, reads events[2] which is NULL, and
>>> crashes dereferencing it.
>>>
>>> Move the cpuc->events[hwc->idx] assignment in x86_pmu_enable() to
>>> before the PERF_HES_ARCH check, so that events[] is populated even
>>> for events that are not immediately started. This ensures the
>>> unthrottle path via pmu->start() always finds a valid event pointer.
>>>
>>> Fixes: 7e772a93eb61 ("perf/x86: Fix NULL event access and potential PEBS record loss")
>>> Signed-off-by: Breno Leitao <leitao@xxxxxxxxxx>
>>> Cc: stable@xxxxxxxxxxxxxxx
>>> ---
>>> Changes in v2:
>>> - Move event pointer setup earlier in x86_pmu_enable() (peterz)
>>> - Rewrote the patch title, given the new approach
>>> - Link to v1: https://patch.msgid.link/20260309-perf-v1-1-601ffb531893@xxxxxxxxxx
>>> ---
>>> arch/x86/events/core.c | 3 ++-
>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>>> index 03ce1bc7ef2ea..54b4c315d927f 100644
>>> --- a/arch/x86/events/core.c
>>> +++ b/arch/x86/events/core.c
>>> @@ -1372,6 +1372,8 @@ static void x86_pmu_enable(struct pmu *pmu)
>>> else if (i < n_running)
>>> continue;
>>>
>>> + cpuc->events[hwc->idx] = event;
>>> +
>>> if (hwc->state & PERF_HES_ARCH)
>>> continue;
>>>
>>> @@ -1379,7 +1381,6 @@ static void x86_pmu_enable(struct pmu *pmu)
>>> * if cpuc->enabled = 0, then no wrmsr as
>>> * per x86_pmu_enable_event()
>>> */
>>> - cpuc->events[hwc->idx] = event;
>>> x86_pmu_start(event, PERF_EF_RELOAD);
>>> }
>>> cpuc->n_added = 0;
>> Just think twice, it seems the change could slightly break the logic of
>> current PEBS counter snapshot logic.
>>
>> Currently the function intel_perf_event_update_pmc() needs to filter out
>> these uninitialized counter by checking if the event is NULL as below
>> comments and code show.
>>
>> ```
>>
>> * - An event is stopped for some reason, e.g., throttled.
>> * During this period, another event is added and takes the
>> * counter of the stopped event. The stopped event is assigned
>> * to another new and uninitialized counter, since the
>> * x86_pmu_start(RELOAD) is not invoked for a stopped event.
>> * The PEBS__DATA_CFG is updated regardless of the event state.
>> * The uninitialized counter can be recorded in a PEBS record.
>> * But the cpuc->events[uninitialized_counter] is always NULL,
>> * because the event is stopped. The uninitialized value is
>> * safely dropped.
>> */
>> if (!event)
>> return;
>>
>> ```
>>
>> Once we have this change, then the original index of a stopped event could
>> be assigned to a new event. In these case, although the new event is still
>> not activated, the cpuc->events[original_index] has been initialized and
>> won't be NULL. So intel_perf_event_update_pmc() could update the cached
>> count value to wrong event.
>>
>> I suppose we have two ways to fix this issue.
>>
>> 1. Move "cpuc->events[idx] = event" into x86_pmu_start(), just like what
>> the v1 patch does.
> That's not what v1 did; v1 did an additional setting.

Oh, yeah. "move" is not an accurate word.

>
>> 2. Check cpuc->active_mask in intel_perf_event_update_pmc() as well, but
>> the side effect is that the cached counter snapshots for the stopped events
>> have to be dropped and it has no chance to update the count value for these
>> stopped events even though the HW index of these stopped events are not
>> occupied by other new events.
>>
>> Peter, how's your idea on this? Thanks.
> So you're saying that intel_perf_event_update_pmc() will be trying to
> read the hardware counter; which hasn't been written with a sensible
> value (and thus mis-behave) even though the event is STOPPED and the
> active_mask bit is unset?

Yes, currently intel_perf_event_update_pmc() only checks whether the
cpuc->event[x] (assume the counter index is "x") pointer is NULL before
updating the event count according to the counter snapshot in PEBS record.
But after the patch 7e772a93eb61 ("perf/x86: Fix NULL event access and
potential PEBS record loss"), cpuc->event[x] won't be cleared after the
event is STOPPED until the event is really deleted, then
intel_perf_event_update_pmc() could still update the event count for a
STOPPED event when draining up buffered PEBS records.

Even worse, the original counter index "x" could be assigned to a new event
in scheduling, then the snapshotted count value could be update the new
event incorrectly.

>
> I'm thinking intel_perf_event_update_pmc() needs help either way around
> :-)

Sure. Would add an active_mask check there.