Re: [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work.

From: Sebastian Andrzej Siewior
Date: Tue Apr 09 2024 - 04:57:49 EST


On 2024-04-08 23:29:03 [+0200], Frederic Weisbecker wrote:
> > index c7a0274c662c8..e0b2da8de485f 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -2283,21 +2283,6 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
> > state = PERF_EVENT_STATE_OFF;
> > }
> >
> > - if (event->pending_sigtrap) {
> > - bool dec = true;
> > -
> > - event->pending_sigtrap = 0;
> > - if (state != PERF_EVENT_STATE_OFF &&
> > - !event->pending_work) {
> > - event->pending_work = 1;
> > - dec = false;
> > - WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
> > - task_work_add(current, &event->pending_task, TWA_RESUME);
> > - }
> > - if (dec)
> > - local_dec(&event->ctx->nr_pending);
> > - }
> > -
> > perf_event_set_state(event, state);
> >
> > if (!is_software_event(event))
> > @@ -6741,11 +6726,6 @@ static void __perf_pending_irq(struct perf_event *event)
> > * Yay, we hit home and are in the context of the event.
> > */
> > if (cpu == smp_processor_id()) {
> > - if (event->pending_sigtrap) {
> > - event->pending_sigtrap = 0;
> > - perf_sigtrap(event);
> > - local_dec(&event->ctx->nr_pending);
> > - }
> > if (event->pending_disable) {
> > event->pending_disable = 0;
> > perf_event_disable_local(event);
> > @@ -9592,14 +9572,23 @@ static int __perf_event_overflow(struct perf_event *event,
> >
> > if (regs)
> > pending_id = hash32_ptr((void *)instruction_pointer(regs)) ?: 1;
> > - if (!event->pending_sigtrap) {
> > - event->pending_sigtrap = pending_id;
> > + if (!event->pending_work) {
> > + event->pending_work = pending_id;
> > local_inc(&event->ctx->nr_pending);
> > - irq_work_queue(&event->pending_irq);
> > + WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount));
> > + task_work_add(current, &event->pending_task, TWA_RESUME);
>
> If the overflow happens between exit_task_work() and perf_event_exit_task(),
> you're leaking the event. (This was there before this patch).
> See:
> https://lore.kernel.org/all/202403310406.TPrIela8-lkp@xxxxxxxxx/T/#m5e6c8ebbef04ab9a1d7f05340cd3e2716a9a8c39

Okay.

> > + /*
> > + * The NMI path returns directly to userland. The
> > + * irq_work is raised as a dummy interrupt to ensure
> > + * regular return path to user is taken and task_work
> > + * is processed.
> > + */
> > + if (in_nmi())
> > + irq_work_queue(&event->pending_irq);
> > } else if (event->attr.exclude_kernel && valid_sample) {
> > /*
> > * Should not be able to return to user space without
> > - * consuming pending_sigtrap; with exceptions:
> > + * consuming pending_work; with exceptions:
> > *
> > * 1. Where !exclude_kernel, events can overflow again
> > * in the kernel without returning to user space.
> > @@ -9609,7 +9598,7 @@ static int __perf_event_overflow(struct perf_event *event,
> > * To approximate progress (with false negatives),
> > * check 32-bit hash of the current IP.
> > */
> > - WARN_ON_ONCE(event->pending_sigtrap != pending_id);
> > + WARN_ON_ONCE(event->pending_work != pending_id);
> > }
> >
> > event->pending_addr = 0;
> > @@ -13049,6 +13038,13 @@ static void sync_child_event(struct perf_event *child_event)
> > &parent_event->child_total_time_running);
> > }
> >
> > +static bool task_work_cb_match(struct callback_head *cb, void *data)
> > +{
> > + struct perf_event *event = container_of(cb, struct perf_event, pending_task);
> > +
> > + return event == data;
> > +}
>
> I suggest we introduce a proper API to cancel an actual callback head, see:
>
> https://lore.kernel.org/all/202403310406.TPrIela8-lkp@xxxxxxxxx/T/#mbfac417463018394f9d80c68c7f2cafe9d066a4b
> https://lore.kernel.org/all/202403310406.TPrIela8-lkp@xxxxxxxxx/T/#m0a347249a462523358724085f2489ce9ed91e640

This rework would work.

> > static void
> > perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> > {
> > @@ -13088,6 +13084,18 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
> > * Kick perf_poll() for is_event_hup();
> > */
> > perf_event_wakeup(parent_event);
> > + /*
> > + * Cancel pending task_work and update counters if it has not
> > + * yet been delivered to userland. free_event() expects the
> > + * reference counter at one and keeping the event around until
> > + * the task returns to userland can be a unexpected if there is
> > + * no signal handler registered.
> > + */
> > + if (event->pending_work &&
> > + task_work_cancel_match(current, task_work_cb_match, event)) {
> > + put_event(event);
> > + local_dec(&event->ctx->nr_pending);
> > + }
>
> So exiting task, privileged exec and also exit on exec call into this before
> releasing the children.
>
> And parents rely on put_event() from file close + the task work.
>
> But what about remote release of children on file close?
> See perf_event_release_kernel() directly calling free_event() on them.

Interesting things you are presenting. I had events popping up at random
even after the task decided that it won't go back to userland to handle
it so letting it free looked like the only option…

> One possible fix is to avoid the reference count game around task work
> and flush them on free_event().
>
> See here:
>
> https://lore.kernel.org/all/202403310406.TPrIela8-lkp@xxxxxxxxx/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50

That wake_up() within preempt_disable() section breaks on RT.

How do we go on from here?

> Thanks.

Sebastian