Re: [PATCH 3/4] perf: Fix event leak upon exit

From: Peter Zijlstra
Date: Thu May 16 2024 - 05:05:52 EST


On Wed, May 15, 2024 at 04:43:10PM +0200, Frederic Weisbecker wrote:
> When a task is scheduled out, pending sigtrap deliveries are deferred
> to the target task upon resume to userspace via task_work.
>
> However failures while adding en event's callback to the task_work
> engine are ignored. And since the last call for events exit happen
> after task work is eventually closed, there is a small window during
> which pending sigtrap can be queued though ignored, leaking the event
> refcount addition such as in the following scenario:
>
> TASK A
> -----
>
> do_exit()
> exit_task_work(tsk);
>
> <IRQ>
> perf_event_overflow()
> event->pending_sigtrap = pending_id;
> irq_work_queue(&event->pending_irq);
> </IRQ>
> =========> PREEMPTION: TASK A -> TASK B
> event_sched_out()
> event->pending_sigtrap = 0;
> atomic_long_inc_not_zero(&event->refcount)
> // FAILS: task work has exited
> task_work_add(&event->pending_task)
> [...]
> <IRQ WORK>
> perf_pending_irq()
> // early return: event->oncpu = -1
> </IRQ WORK>
> [...]
> =========> TASK B -> TASK A
> perf_event_exit_task(tsk)
> perf_event_exit_event()
> free_event()
> WARN(atomic_long_cmpxchg(&event->refcount, 1, 0) != 1)
> // leak event due to unexpected refcount == 2
>
> As a result the event is never released while the task exits.

Urgh...

>
> Fix this with appropriate task_work_add()'s error handling.
>
> Fixes: 517e6a301f34 ("perf: Fix perf_pending_task() UaF")
> Signed-off-by: Frederic Weisbecker <frederic@xxxxxxxxxx>
> ---
> kernel/events/core.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 724e6d7e128f..c1632e69c69d 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2289,10 +2289,11 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
> event->pending_sigtrap = 0;
> if (state != PERF_EVENT_STATE_OFF &&
> !event->pending_work) {
> - event->pending_work = 1;
> - dec = false;
> - WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
> - task_work_add(current, &event->pending_task, TWA_RESUME);
> + if (task_work_add(current, &event->pending_task, TWA_RESUME) >= 0) {

AFAICT the thing is a return 0 on success -Efoo on fail, no? That is,
should this not simply be '== 0' ?

> + WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
> + dec = false;
> + event->pending_work = 1;
> + }

Also, do we want to write it like so and save an indent?

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2288,11 +2288,11 @@ event_sched_out(struct perf_event *event

event->pending_sigtrap = 0;
if (state != PERF_EVENT_STATE_OFF &&
- !event->pending_work) {
+ !event->pending_work &&
+ !task_work_add(current, &event->pending_task, TWA_RESUME)) {
event->pending_work = 1;
dec = false;
WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
- task_work_add(current, &event->pending_task, TWA_RESUME);
}
if (dec)
local_dec(&event->ctx->nr_pending);