Re: [PATCH v2] perf: Synchronously cleanup child events

From: Alexei Starovoitov
Date: Wed Jan 20 2016 - 23:56:12 EST

On Wed, Jan 20, 2016 at 09:32:22AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 19, 2016 at 01:58:19PM -0800, Alexei Starovoitov wrote:
> > On Tue, Jan 19, 2016 at 09:05:58PM +0100, Peter Zijlstra wrote:
> > > The most obvious place that generates such magical references would be
> > > the bpf arraymap doing perf_event_get() on things. There are a few other
> > > places that take temp references (perf_mmap_close), but those are
> > > 'short' lived and while ugly will not cause massive grief. The BPF one
> > > OTOH is a real problem here.
> > >
> > > And looking at the BPF stuff, that code seems to assume
> > > perf_event_kernel_release() := put_event(), so this patch breaks that
> > > too.
> > >
> > >
> > > Alexei, is there a reason the arraymap stuff needs a perf event ref as
> > > opposed to a file ref? I'm forever a little confused on how perf<->bpf
> > > works.
> >
> > A file ref will not work, since user space could have closed that
> > perf_event file to avoid unnecessary FDs.
> So I'm (possibly again) confused on how BPF works.
> I thought the reason you handed in perf events from userspace; as
> opposed to creating your own with perf_event_create_kernel_counter();
> was because userspace was interested in the output.

yes. There are two use cases of perf_events from bpf:
1. sw_bpf_output event is used by bpf to push samples into it and
user spaces reads it as normal via mmap
2. PERF_TYPE_HARDWARE event is used by bpf program to read
counters to measure things like number of cycles or tlb misses
in a given function.
In this case user space typically leaves FDs around, but it doesn't
use them for anything.

> Also, BPF should not be a way to get around the filedesc resource limit.

all bpf tracing stuff is root only and maps are charged for every element.

> > Program only need the stable pointer to 'struct perf_event' which
> > it will use while running.
> > At the end it will call perf_event_kernel_release() which
> > is == put_event().
> > It was the case that 'perf_events' were normal refcnt-ed structures
> > and the last guy frees it.
> Sort-of, but user events are (or should be, rather) tied to the filedesc
> to account the resources used.
> There is also the event->owner field, we track the task that created the
> event, with your current scheme that is left dangling once userspace
> closes the last filedesc and you still have a ref open.
> > This put_event_last() logic definitely looks problematic.
> > There are no ordering guarantees.
> > User space may close FD, while struct perf_event is still alive.
> > The loop around perf_event_last() looks buggy.
> > I'm obviously missing the main goal of this patch.
> Right, so the patch in question tries to synchronously clean up
> everything related to the counter when we close the file. Such that the
> file better reflects the actual resource usage.
> Currently we do this async (and with holes).
> In short, user created event really should be filedesc based, yes we
> have event references, but those 'should' be short lived.

I'm still missing why it's the problem.
Which counter do you want to bump as part of perf_event_get() ?
still event->refcount, right?
but the same perf_event can be stored in multiple bpf maps
and many bpf programs can be using it, while nothing can
possibly prevent the user space to do close(perf_event_fd)
while programs are still running and collecting tlb miss data
from the counters.
So what do you propose?