Re: [tip: perf/core] perf: Simplify perf_event_free_task() wait

From: Frederic Weisbecker
Date: Wed Apr 09 2025 - 09:01:47 EST


Le Tue, Apr 08, 2025 at 07:05:04PM -0000, tip-bot2 for Peter Zijlstra a écrit :
> The following commit has been merged into the perf/core branch of tip:
>
> Commit-ID: 59f3aa4a3ee27e96132e16d2d2bdc3acadb4bf79
> Gitweb: https://git.kernel.org/tip/59f3aa4a3ee27e96132e16d2d2bdc3acadb4bf79
> Author: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> AuthorDate: Fri, 17 Jan 2025 15:27:07 +01:00
> Committer: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> CommitterDate: Tue, 08 Apr 2025 20:55:46 +02:00
>
> perf: Simplify perf_event_free_task() wait
>
> Simplify the code by moving the duplicated wakeup condition into
> put_ctx().
>
> Notably, wait_var_event() is in perf_event_free_task() and will have
> set ctx->task = TASK_TOMBSTONE.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> Reviewed-by: Ravi Bangoria <ravi.bangoria@xxxxxxx>
> Link: https://lkml.kernel.org/r/20250307193723.044499344@xxxxxxxxxxxxx
> ---
> kernel/events/core.c | 25 +++----------------------
> 1 file changed, 3 insertions(+), 22 deletions(-)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 3c92b75..fa6dab0 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -1270,6 +1270,9 @@ static void put_ctx(struct perf_event_context *ctx)
> if (ctx->task && ctx->task != TASK_TOMBSTONE)
> put_task_struct(ctx->task);
> call_rcu(&ctx->rcu_head, free_ctx);
> + } else if (ctx->task == TASK_TOMBSTONE) {
> + smp_mb(); /* pairs with wait_var_event() */
> + wake_up_var(&ctx->refcount);

So there are three situations:

* If perf_event_free_task() has removed all the children from the parent list
before perf_event_release_kernel() got a chance to even iterate them, then
it's all good as there is no get_ctx() pending.

* If perf_event_release_kernel() iterates a child event, but it gets freed
meanwhile by perf_event_free_task() while the mutexes are temporarily
unlocked, it's all good because while locking again the ctx mutex,
perf_event_release_kernel() observes TASK_TOMBSTONE.

* But if perf_event_release_kernel() frees the child event before
perf_event_free_task() got a chance, we may face this scenario:

perf_event_release_kernel() perf_event_free_task()
-------------------------- ------------------------
mutex_lock(&event->child_mutex)
get_ctx(child->ctx)
mutex_unlock(&event->child_mutex)

mutex_lock(ctx->mutex)
mutex_lock(&event->child_mutex)
perf_remove_from_context(child)
mutex_unlock(&event->child_mutex)
mutex_unlock(ctx->mutex)

// This lock acquires ctx->refcount == 2
// visibility
mutex_lock(ctx->mutex)
ctx->task = TASK_TOMBSTONE
mutex_unlock(ctx->mutex)

wait_var_event()
// enters prepare_to_wait() since
// ctx->refcount == 2
// is guaranteed to be seen
set_current_state(TASK_INTERRUPTIBLE)
smp_mb()
if (ctx->refcount != 1)
schedule()
put_ctx()
// NOT fully ordered! Only RELEASE semantics
refcount_dec_and_test()
atomic_fetch_sub_release()
// So TASK_TOMBSTONE is not guaranteed to be seen
if (ctx->task == TASK_TOMBSTONE)
wake_up_var()

Basically it's a broken store buffer:

perf_event_release_kernel() perf_event_free_task()
-------------------------- ------------------------
ctx->task = TASK_TOMBSTONE smp_store_release(&ctx->refcount, ctx->refcount - 1)
smp_mb()
READ_ONCE(ctx->refcount) READ_ONCE(ctx->task)


So you need this:

diff --git a/kernel/events/core.c b/kernel/events/core.c
index fa6dab08be47..c4fbbe25361a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1270,9 +1270,10 @@ static void put_ctx(struct perf_event_context *ctx)
if (ctx->task && ctx->task != TASK_TOMBSTONE)
put_task_struct(ctx->task);
call_rcu(&ctx->rcu_head, free_ctx);
- } else if (ctx->task == TASK_TOMBSTONE) {
+ } else {
smp_mb(); /* pairs with wait_var_event() */
- wake_up_var(&ctx->refcount);
+ if (ctx->task == TASK_TOMBSTONE)
+ wake_up_var(&ctx->refcount);
}
}



--
Frederic Weisbecker
SUSE Labs