[PATCH AUTOSEL 6.19-6.18] perf/core: Fix slow perf_event_task_exit() with LBR callstacks

From: Sasha Levin

Date: Wed Feb 11 2026 - 20:14:24 EST


From: Namhyung Kim <namhyung@xxxxxxxxxx>

[ Upstream commit 4960626f956d63dce57f099016c2ecbe637a8229 ]

I got a report that a task is stuck in perf_event_exit_task() waiting
for global_ctx_data_rwsem. On large systems with lots threads, it'd
have performance issues when it grabs the lock to iterate all threads
in the system to allocate the context data.

And it'd block task exit path which is problematic especially under
memory pressure.

perf_event_open
perf_event_alloc
attach_perf_ctx_data
attach_global_ctx_data
percpu_down_write (global_ctx_data_rwsem)
for_each_process_thread
alloc_task_ctx_data
do_exit
perf_event_exit_task
percpu_down_read (global_ctx_data_rwsem)

It should not hold the global_ctx_data_rwsem on the exit path. Let's
skip allocation for exiting tasks and free the data carefully.

Reported-by: Rosalie Fang <rosaliefang@xxxxxxxxxx>
Suggested-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Signed-off-by: Namhyung Kim <namhyung@xxxxxxxxxx>
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
Link: https://patch.msgid.link/20260112165157.1919624-1-namhyung@xxxxxxxxxx
Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>
---

LLM Generated explanations, may be completely bogus:

---

## Comprehensive Analysis

### 1. COMMIT MESSAGE ANALYSIS

**Subject**: "perf/core: Fix slow perf_event_task_exit() with LBR
callstacks"

The commit message clearly describes a **performance regression / system
hang** scenario. Key indicators:
- **"I got a report that a task is stuck"** - real-world report from a
Google engineer (Rosalie Fang)
- **"On large systems with lots threads, it'd have performance issues"**
- reproducible on production systems
- **"it'd block task exit path which is problematic especially under
memory pressure"** - can escalate to system unresponsiveness
- **Reported-by: Rosalie Fang <rosaliefang@xxxxxxxxxx>** - actual user
report
- **Suggested-by: Peter Zijlstra** and **Signed-off-by: Peter Zijlstra**
- the perf subsystem maintainer suggested and approved the fix

The commit message illustrates the exact deadlock scenario:
1. `perf_event_open` -> `attach_global_ctx_data` takes
`global_ctx_data_rwsem` as a **writer** and iterates all threads to
allocate context data
2. Simultaneously, a task calling `do_exit` -> `perf_event_exit_task`
tries to take `global_ctx_data_rwsem` as a **reader**
3. On large systems with many threads, the write lock is held for a long
time during the `for_each_process_thread` loop, blocking ALL task
exits

This is effectively a **priority inversion / livelock** scenario where
task exit (a critical path) is blocked by a potentially very long
operation (iterating and allocating for all threads in the system).

### 2. CODE CHANGE ANALYSIS

The patch makes three coordinated changes:

#### Change 1: Skip exiting tasks in `attach_global_ctx_data()` (lines
5483-5484 in the diff)

```c
for_each_process_thread(g, p) {
if (p->flags & PF_EXITING)
continue;
```

This adds a check to skip tasks that are already exiting during the
global iteration. No point allocating context data for a task that's
about to die.

#### Change 2: Detect and undo allocation for exiting tasks in
`attach_task_ctx_data()` (lines 5427-5434 in the diff)

After successfully attaching via `try_cmpxchg`, the code now checks:
```c
if (task->flags & PF_EXITING) {
/* detach_task_ctx_data() may free it already */
if (try_cmpxchg(&task->perf_ctx_data, &cd, NULL))
perf_free_ctx_data_rcu(cd);
}
```

This handles the race where `attach_global_ctx_data()` allocates for a
task that starts exiting between the `PF_EXITING` check and the
`try_cmpxchg`. If we detect the task is exiting, we undo our allocation.

The key insight: The `try_cmpxchg()` in `attach_task_ctx_data()` pairs
with the `try_cmpxchg()` in `detach_task_ctx_data()` to provide total
ordering. If `attach_task_ctx_data()` succeeds the cmpxchg first, it
will see `PF_EXITING` and undo the allocation. If
`detach_task_ctx_data()` (called from `perf_event_exit_task`) succeeds
first, the undo cmpxchg will fail (because `cd` is no longer at
`task->perf_ctx_data`), which is fine.

#### Change 3: Remove lock from `perf_event_exit_task()` (lines
14558-14603 in the diff)

The critical change:
```c
// BEFORE:
guard(percpu_read)(&global_ctx_data_rwsem);
detach_task_ctx_data(task);

// AFTER (no lock):
detach_task_ctx_data(task);
```

The comment explains the correctness:
> Done without holding global_ctx_data_rwsem; typically
attach_global_ctx_data() will skip over this task, but otherwise
attach_task_ctx_data() will observe PF_EXITING.

**Correctness argument**:
- `PF_EXITING` is set in `exit_signals()` (line 913 of exit.c)
**before** `perf_event_exit_task()` is called (line 951)
- The `try_cmpxchg()` operations provide atomic visibility of
`task->perf_ctx_data` changes
- If `attach_global_ctx_data()` races with exit: either it sees
`PF_EXITING` and skips, or if it allocates, `attach_task_ctx_data()`
sees `PF_EXITING` after the cmpxchg and undoes the allocation
- `detach_task_ctx_data()` uses `try_cmpxchg` to atomically clear the
pointer, so concurrent operations are safe

### 3. BUG CLASSIFICATION

This is a **performance regression / system hang** fix. The
`global_ctx_data_rwsem` write lock blocks ALL readers (task exits) while
iterating ALL threads. On systems with thousands of threads:
- Opening a perf event with LBR callstacks causes the write lock to be
held for a long time
- Every task trying to exit during this period blocks on the read lock
- Under memory pressure, blocked task exits compound the problem (tasks
holding memory can't release it)
- This can effectively hang the system

### 4. SCOPE AND RISK ASSESSMENT

**Lines changed**: ~25 lines added/changed in a single file
(`kernel/events/core.c`)
**Files touched**: 1
**Complexity**: Moderate - the synchronization relies on cmpxchg +
PF_EXITING flag ordering
**Risk**: LOW-MEDIUM
- The fix is self-contained within the perf subsystem
- The cmpxchg-based synchronization replaces a lock-based approach,
which is more lockless but well-reasoned
- Peter Zijlstra (the maintainer) both suggested and signed off on the
approach
- The worst case if the fix has a subtle race: a small memory leak of
one `perf_ctx_data` allocation (not a crash)

### 5. USER IMPACT

**Who is affected**: Anyone using perf with LBR callstacks (Intel) in
system-wide mode on systems with many threads. This is common on:
- Large servers doing production profiling
- CI/CD systems running perf monitoring
- Google's production fleet (where the bug was reported)

**Severity**: HIGH - can block the task exit path, which is critical.
Under memory pressure this can effectively hang the system.

### 6. STABILITY & TRUST INDICATORS

- **Reported-by**: Rosalie Fang from Google - real production issue
- **Suggested-by**: Peter Zijlstra (perf maintainer) - expert-designed
solution
- **Signed-off-by**: Both Namhyung Kim (author) and Peter Zijlstra
(maintainer)
- **Link**: Has a patch.msgid.link reference showing proper mailing list
review

### 7. DEPENDENCY CHECK

The affected code was introduced in:
- `cb43691293390` "perf: Save PMU specific data in task_struct"
(v6.15-rc1)
- `506e64e710ff9` "perf: attach/detach PMU specific data" (v6.15-rc1)

**Applicable stable trees**: 6.15.y, 6.16.y, 6.17.y, 6.18.y
**NOT applicable**: 6.14.y and earlier (code doesn't exist)

The patch is self-contained - it doesn't depend on any other commits. It
modifies only the code introduced by the above commits and the logic is
self-sufficient.

### 8. MINOR CLEANUP CONCERN

The patch also removes a `(struct perf_ctx_data **)` cast from one
`try_cmpxchg` call in `attach_task_ctx_data()`. This is a minor
whitespace/type-correctness issue and doesn't affect functionality -
`try_cmpxchg` on an `__rcu` pointer works the same way.

### CONCLUSION

This commit fixes a real, reported, production-impacting performance bug
that can cause task exit to hang on systems with many threads when LBR
callstack profiling is active. The fix is:

1. **Obviously correct**: The synchronization relies on the well-
established pattern of `PF_EXITING` flag + atomic cmpxchg operations,
with clear ordering guarantees
2. **Fixes a real bug**: Reported by a Google engineer on production
systems
3. **Important**: Can cause system hangs, especially under memory
pressure
4. **Small and contained**: ~25 lines in a single file, touching only
the perf subsystem
5. **No new features**: Just removes unnecessary lock contention
6. **Expert-reviewed**: Suggested by Peter Zijlstra, the perf subsystem
maintainer

The fix should be backported to all stable trees containing the affected
code (6.15.y through 6.18.y).

**YES**

kernel/events/core.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8cca800946248..69c56cad88a89 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5280,9 +5280,20 @@ attach_task_ctx_data(struct task_struct *task, struct kmem_cache *ctx_cache,
return -ENOMEM;

for (;;) {
- if (try_cmpxchg((struct perf_ctx_data **)&task->perf_ctx_data, &old, cd)) {
+ if (try_cmpxchg(&task->perf_ctx_data, &old, cd)) {
if (old)
perf_free_ctx_data_rcu(old);
+ /*
+ * Above try_cmpxchg() pairs with try_cmpxchg() from
+ * detach_task_ctx_data() such that
+ * if we race with perf_event_exit_task(), we must
+ * observe PF_EXITING.
+ */
+ if (task->flags & PF_EXITING) {
+ /* detach_task_ctx_data() may free it already */
+ if (try_cmpxchg(&task->perf_ctx_data, &cd, NULL))
+ perf_free_ctx_data_rcu(cd);
+ }
return 0;
}

@@ -5328,6 +5339,8 @@ attach_global_ctx_data(struct kmem_cache *ctx_cache)
/* Allocate everything */
scoped_guard (rcu) {
for_each_process_thread(g, p) {
+ if (p->flags & PF_EXITING)
+ continue;
cd = rcu_dereference(p->perf_ctx_data);
if (cd && !cd->global) {
cd->global = 1;
@@ -14294,8 +14307,11 @@ void perf_event_exit_task(struct task_struct *task)

/*
* Detach the perf_ctx_data for the system-wide event.
+ *
+ * Done without holding global_ctx_data_rwsem; typically
+ * attach_global_ctx_data() will skip over this task, but otherwise
+ * attach_task_ctx_data() will observe PF_EXITING.
*/
- guard(percpu_read)(&global_ctx_data_rwsem);
detach_task_ctx_data(task);
}

--
2.51.0