Re: [PATCH 19/19] perf: Make perf_pmu_unregister() useable
From: Peter Zijlstra
Date: Fri Jan 17 2025 - 08:04:45 EST
On Fri, Jan 17, 2025 at 01:03:16AM +0100, Peter Zijlstra wrote:
> > 2) A race with perf_event_release_kernel(). perf_event_release_kernel()
> > prepares a separate "free_list" of all children events under ctx->mutex
> > and event->child_mutex. However, the "free_list" uses the same
> > "event->child_list" for entries. OTOH, perf_pmu_unregister() ultimately
> > calls __perf_remove_from_context() with DETACH_CHILD, which checks if
> > the event being removed is a child event, and if so, it will try to
> > detach the child from parent using list_del_init(&event->child_list);
> > i.e. two code path doing list_del on the same list entry.
> >
> > perf_event_release_kernel() perf_pmu_unregister()
> > /* Move children events to free_list */ ...
> > list_for_each_entry_safe(child, tmp, &free_list, child_list) { perf_remove_from_context() /* with DETACH_CHILD */
> > ... __perf_remove_from_context()
> > list_del(&child->child_list); perf_child_detach()
> > list_del_init(&event->child_list);
>
> Bah, I had figured it was taken care of, because perf_event_exit_event()
> has a similar race. I'll try and figure out what to do there.
So, the problem appears to be that perf_event_release_kernel() does not
use DETACH_CHILD, doing so will clear PERF_ATTACH_CHILD, at which point
the above is fully serialized by parent->child_mutex.
Then the next problem is that since pmu_detach_events() can hold an
extra ref on things, the free_event() from free_list will WARN, like
before.
Easily fixed by making that put_event(), except that messes up the whole
wait_var_event() scheme -- since __free_event() does the final
put_ctx().
This in turn can be fixed by pushing that wake_up_var() nonsense into
put_ctx() itself.
Which then gives me something like so.
But also, I think we can get rid of that free_list entirely.
Anyway, let me go break this up into individual patches and go test
this -- after lunch!
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1229,8 +1229,14 @@ static void put_ctx(struct perf_event_co
if (refcount_dec_and_test(&ctx->refcount)) {
if (ctx->parent_ctx)
put_ctx(ctx->parent_ctx);
- if (ctx->task && ctx->task != TASK_TOMBSTONE)
- put_task_struct(ctx->task);
+ if (ctx->task) {
+ if (ctx->task == TASK_TOMBSTONE) {
+ smp_mb();
+ wake_up_var(&ctx->refcount);
+ } else {
+ put_task_struct(ctx->task);
+ }
+ }
call_rcu(&ctx->rcu_head, free_ctx);
}
}
@@ -5550,8 +5556,6 @@ int perf_event_release_kernel(struct per
again:
mutex_lock(&event->child_mutex);
list_for_each_entry(child, &event->child_list, child_list) {
- void *var = NULL;
-
/*
* Cannot change, child events are not migrated, see the
* comment with perf_event_ctx_lock_nested().
@@ -5584,46 +5588,32 @@ int perf_event_release_kernel(struct per
tmp = list_first_entry_or_null(&event->child_list,
struct perf_event, child_list);
if (tmp == child) {
- perf_remove_from_context(child, DETACH_GROUP);
- list_move(&child->child_list, &free_list);
+ perf_remove_from_context(child, DETACH_GROUP | DETACH_CHILD);
+ /*
+ * Can't risk calling into free_event() here, since
+ * event->destroy() might invert with the currently
+ * held locks, see 82d94856fa22 ("perf/core: Fix lock
+ * inversion between perf,trace,cpuhp")
+ */
+ list_add(&child->child_list, &free_list);
/*
* This matches the refcount bump in inherit_event();
* this can't be the last reference.
*/
put_event(event);
- } else {
- var = &ctx->refcount;
}
mutex_unlock(&event->child_mutex);
mutex_unlock(&ctx->mutex);
put_ctx(ctx);
- if (var) {
- /*
- * If perf_event_free_task() has deleted all events from the
- * ctx while the child_mutex got released above, make sure to
- * notify about the preceding put_ctx().
- */
- smp_mb(); /* pairs with wait_var_event() */
- wake_up_var(var);
- }
goto again;
}
mutex_unlock(&event->child_mutex);
list_for_each_entry_safe(child, tmp, &free_list, child_list) {
- void *var = &child->ctx->refcount;
-
list_del(&child->child_list);
- free_event(child);
-
- /*
- * Wake any perf_event_free_task() waiting for this event to be
- * freed.
- */
- smp_mb(); /* pairs with wait_var_event() */
- wake_up_var(var);
+ put_event(child);
}
no_ctx: