Re: [PATCH 1/8] perf/kprobe: Add support to create multiple probes

From: Jiri Olsa
Date: Sun Nov 28 2021 - 17:36:32 EST


On Sun, Nov 28, 2021 at 10:49:54PM +0900, Masami Hiramatsu wrote:
> On Wed, 24 Nov 2021 09:41:12 +0100
> Jiri Olsa <jolsa@xxxxxxxxxx> wrote:
>
> > Adding support to create multiple probes within single perf event.
> > This way we can associate single bpf program with multiple kprobes,
> > because bpf program gets associated with the perf event.
> >
> > The perf_event_attr is not extended, current fields for kprobe
> > attachment are used for multi attachment.
> >
> > For current kprobe atachment we use either:
> >
> > kprobe_func (in config1) + probe_offset (in config2)
> >
> > to define kprobe by function name with offset, or:
> >
> > kprobe_addr (in config2)
> >
> > to define kprobe with direct address value.
> >
> > For multi probe attach the same fields point to array of values
> > with the same semantic. Each probe is defined as set of values
> > with the same array index (idx) as:
> >
> > kprobe_func[idx] + probe_offset[idx]
> >
> > to define kprobe by function name with offset, or:
> >
> > kprobe_addr[idx]
> >
> > to define kprobe with direct address value.
> >
> > The number of probes is passed in probe_cnt value, which shares
> > the union with wakeup_events/wakeup_watermark values which are
> > not used for kprobes.
> >
> > Since [1] it's possible to stack multiple probes events under
> > one head event. Using the same code to allow that for probes
> > defined under perf kprobe interface.
>
> OK, so you also want to add multi-probes on single event by
> single perf-event syscall. Not defining different events.

correct.. bpf program is then attached to perf event with
ioctl call.. this way we can have multiple probes attached
to single bpf program

>
> Those are bit different, multi-probes on single event can
> invoke single event handler from different probe points. For
> exapmple same user bpf handler will be invoked from different
> address.

right, that's the goal, having single bpf program executed
from multiple probes

>
> >
> > [1] https://lore.kernel.org/lkml/156095682948.28024.14190188071338900568.stgit@devnote2/
> > Signed-off-by: Jiri Olsa <jolsa@xxxxxxxxxx>
> > ---
> > include/uapi/linux/perf_event.h | 1 +
> > kernel/trace/trace_event_perf.c | 106 ++++++++++++++++++++++++++++----
> > kernel/trace/trace_kprobe.c | 47 ++++++++++++--
> > kernel/trace/trace_probe.c | 2 +-
> > kernel/trace/trace_probe.h | 3 +-
> > 5 files changed, 138 insertions(+), 21 deletions(-)
> >
> > diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> > index bd8860eeb291..eea80709d1ed 100644
> > --- a/include/uapi/linux/perf_event.h
> > +++ b/include/uapi/linux/perf_event.h
> > @@ -414,6 +414,7 @@ struct perf_event_attr {
> > union {
> > __u32 wakeup_events; /* wakeup every n events */
> > __u32 wakeup_watermark; /* bytes before wakeup */
> > + __u32 probe_cnt; /* number of [k,u] probes */
> > };
> >
> > __u32 bp_type;
> > diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
> > index a114549720d6..26078e40c299 100644
> > --- a/kernel/trace/trace_event_perf.c
> > +++ b/kernel/trace/trace_event_perf.c
> > @@ -245,23 +245,27 @@ void perf_trace_destroy(struct perf_event *p_event)
> > }
> >
> > #ifdef CONFIG_KPROBE_EVENTS
> > -int perf_kprobe_init(struct perf_event *p_event, bool is_retprobe)
> > +static struct trace_event_call*
> > +kprobe_init(bool is_retprobe, u64 kprobe_func, u64 kprobe_addr,
> > + u64 probe_offset, struct trace_event_call *old)
> > {
> > int ret;
> > char *func = NULL;
> > struct trace_event_call *tp_event;
> >
> > - if (p_event->attr.kprobe_func) {
> > + if (kprobe_func) {
> > func = kzalloc(KSYM_NAME_LEN, GFP_KERNEL);
> > if (!func)
> > - return -ENOMEM;
> > + return ERR_PTR(-ENOMEM);
> > ret = strncpy_from_user(
> > - func, u64_to_user_ptr(p_event->attr.kprobe_func),
> > + func, u64_to_user_ptr(kprobe_func),
> > KSYM_NAME_LEN);
> > if (ret == KSYM_NAME_LEN)
> > ret = -E2BIG;
> > - if (ret < 0)
> > - goto out;
> > + if (ret < 0) {
> > + kfree(func);
> > + return ERR_PTR(ret);
> > + }
> >
> > if (func[0] == '\0') {
> > kfree(func);
> > @@ -270,20 +274,96 @@ int perf_kprobe_init(struct perf_event *p_event, bool is_retprobe)
> > }
> >
> > tp_event = create_local_trace_kprobe(
> > - func, (void *)(unsigned long)(p_event->attr.kprobe_addr),
> > - p_event->attr.probe_offset, is_retprobe);
> > - if (IS_ERR(tp_event)) {
> > - ret = PTR_ERR(tp_event);
> > - goto out;
> > + func, (void *)(unsigned long)(kprobe_addr),
> > + probe_offset, is_retprobe, old);
>
> Hmm, here I have a concern (maybe no real issue is caused at this momemnt.)
> Since ftrace's multi-probe event has same event/group name among the
> probes's internal event-calls. However, create_local_trace_kprobe()
> actually uses the "func" name for the event name.
> I think you should choose a randome different "representative" event name
> for the event (not probe), and share it among the probes on the event,
> if the perf event has no event name.
>
> (I'm not sure how the event names are used from inside of the BPF programs,
> but just for the consistency.)

ok, I don't think event names are used, I'll check

>
> > + kfree(func);
> > + return tp_event;
> > +}
> > +
> > +static struct trace_event_call*
> > +kprobe_init_multi(struct perf_event *p_event, bool is_retprobe)
> > +{
> > + void __user *kprobe_func = u64_to_user_ptr(p_event->attr.kprobe_func);
> > + void __user *kprobe_addr = u64_to_user_ptr(p_event->attr.kprobe_addr);
> > + u64 *funcs = NULL, *addrs = NULL, *offs = NULL;
> > + struct trace_event_call *tp_event, *tp_old = NULL;
> > + u32 i, cnt = p_event->attr.probe_cnt;
> > + int ret = -EINVAL;
> > + size_t size;
> > +
> > + if (!cnt)
> > + return ERR_PTR(-EINVAL);
> > +
> > + size = cnt * sizeof(u64);
> > + if (kprobe_func) {
> > + ret = -ENOMEM;
> > + funcs = kmalloc(size, GFP_KERNEL);
> > + if (!funcs)
> > + goto out;
> > + ret = -EFAULT;
> > + if (copy_from_user(funcs, kprobe_func, size))
> > + goto out;
> > + }
> > +
> > + if (kprobe_addr) {
> > + ret = -ENOMEM;
> > + addrs = kmalloc(size, GFP_KERNEL);
> > + if (!addrs)
> > + goto out;
> > + ret = -EFAULT;
> > + if (copy_from_user(addrs, kprobe_addr, size))
> > + goto out;
> > + /* addresses and ofsets share the same array */
> > + offs = addrs;
> > }
> >
> > + for (i = 0; i < cnt; i++) {
> > + tp_event = kprobe_init(is_retprobe, funcs ? funcs[i] : 0,
> > + addrs ? addrs[i] : 0, offs ? offs[i] : 0,
> > + tp_old);
> > + if (IS_ERR(tp_event)) {
> > + if (tp_old)
> > + destroy_local_trace_kprobe(tp_old);
> > + ret = PTR_ERR(tp_event);
> > + goto out;
> > + }
> > + if (!tp_old)
> > + tp_old = tp_event;
> > + }
> > + ret = 0;
> > +out:
> > + kfree(funcs);
> > + kfree(addrs);
> > + return ret ? ERR_PTR(ret) : tp_old;
> > +}
> > +
> > +static struct trace_event_call*
> > +kprobe_init_single(struct perf_event *p_event, bool is_retprobe)
> > +{
> > + struct perf_event_attr *attr = &p_event->attr;
> > +
> > + return kprobe_init(is_retprobe, attr->kprobe_func, attr->kprobe_addr,
> > + attr->probe_offset, NULL);
> > +}
> > +
> > +int perf_kprobe_init(struct perf_event *p_event, bool is_retprobe)
> > +{
> > + struct trace_event_call *tp_event;
> > + int ret;
> > +
> > + if (p_event->attr.probe_cnt)
>
> isn't this "p_event->attr.probe_cnt > 1" ?

right, probe_cnt is just added by this patchset and used only when
there are multiple probes being attached, so that's why it works
even with 1 at the moment

will change

>
> > + tp_event = kprobe_init_multi(p_event, is_retprobe);
> > + else
> > + tp_event = kprobe_init_single(p_event, is_retprobe);
> > +
> > + if (IS_ERR(tp_event))
> > + return PTR_ERR(tp_event);
> > +
> > mutex_lock(&event_mutex);
> > ret = perf_trace_event_init(tp_event, p_event);
> > if (ret)
> > destroy_local_trace_kprobe(tp_event);
> > mutex_unlock(&event_mutex);
> > -out:
> > - kfree(func);
> > return ret;
> > }
> >
> > diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> > index 33272a7b6912..86a7aada853a 100644
> > --- a/kernel/trace/trace_kprobe.c
> > +++ b/kernel/trace/trace_kprobe.c
> > @@ -237,13 +237,18 @@ static int kprobe_dispatcher(struct kprobe *kp, struct pt_regs *regs);
> > static int kretprobe_dispatcher(struct kretprobe_instance *ri,
> > struct pt_regs *regs);
> >
> > +static void __free_trace_kprobe(struct trace_kprobe *tk)
> > +{
> > + kfree(tk->symbol);
> > + free_percpu(tk->nhit);
> > + kfree(tk);
> > +}
> > +
> > static void free_trace_kprobe(struct trace_kprobe *tk)
> > {
> > if (tk) {
> > trace_probe_cleanup(&tk->tp);
> > - kfree(tk->symbol);
> > - free_percpu(tk->nhit);
> > - kfree(tk);
> > + __free_trace_kprobe(tk);
>
> Why is this needed?

I needed some free function that does not call trace_probe_cleanup and
trace_probe_unlink.. I wrote details in the destroy_local_trace_kprobe
comment below

>
> > }
> > }
> >
> > @@ -1796,7 +1801,7 @@ static int unregister_kprobe_event(struct trace_kprobe *tk)
> > /* create a trace_kprobe, but don't add it to global lists */
> > struct trace_event_call *
> > create_local_trace_kprobe(char *func, void *addr, unsigned long offs,
> > - bool is_return)
> > + bool is_return, struct trace_event_call *old)
> > {
> > enum probe_print_type ptype;
> > struct trace_kprobe *tk;
> > @@ -1820,6 +1825,28 @@ create_local_trace_kprobe(char *func, void *addr, unsigned long offs,
> > return ERR_CAST(tk);
> > }
> >
> > + if (old) {
> > + struct trace_kprobe *tk_old;
> > +
> > + tk_old = trace_kprobe_primary_from_call(old);
>
> So, this will choose the first(primary) probe's function name as
> the representative event name. But other probes can have different
> event names.

ok

>
> > + if (!tk_old) {
> > + ret = -EINVAL;
> > + goto error;
> > + }
> > +
> > + /* Append to existing event */
> > + ret = trace_probe_append(&tk->tp, &tk_old->tp);
> > + if (ret)
> > + goto error;
> > +
> > + /* Register k*probe */
> > + ret = __register_trace_kprobe(tk);
> > + if (ret)
> > + goto error;
>
> If "appended" probe failed to register, it must be "unlinked" from
> the first one and goto error to free the trace_kprobe.
>
> if (ret) {
> trace_probe_unlink(&tk->tp);
> goto error;
> }
>
> See append_trace_kprobe() for details.

so there's goto error jumping to:

error:
free_trace_kprobe(tk);

that calls:
trace_probe_cleanup
-> trace_probe_unlink

that should do it, right?

>
> > +
> > + return trace_probe_event_call(&tk->tp);
> > + }
> > +
> > init_trace_event_call(tk);
> >
> > ptype = trace_kprobe_is_return(tk) ?
> > @@ -1841,6 +1868,8 @@ create_local_trace_kprobe(char *func, void *addr, unsigned long offs,
> >
> > void destroy_local_trace_kprobe(struct trace_event_call *event_call)
> > {
> > + struct trace_probe_event *event;
> > + struct trace_probe *pos, *tmp;
> > struct trace_kprobe *tk;
> >
> > tk = trace_kprobe_primary_from_call(event_call);
> > @@ -1852,9 +1881,15 @@ void destroy_local_trace_kprobe(struct trace_event_call *event_call)
> > return;
> > }
> >
> > - __unregister_trace_kprobe(tk);
> > + event = tk->tp.event;
> > + list_for_each_entry_safe(pos, tmp, &event->probes, list) {
> > + list_for_each_entry_safe(pos, tmp, &event->probes, list) {
> > + list_del_init(&pos->list);
> > + __unregister_trace_kprobe(tk);
> > + __free_trace_kprobe(tk);
> > + }
> >
> > - free_trace_kprobe(tk);
> > + trace_probe_event_free(event);
>
> Actually, each probe already allocated the trace_probe events (which are not
> used if it is appended). Thus you have to use trace_probe_unlink(&tk->tp) in
> the above loop.
>
> list_for_each_entry_safe(pos, tmp, &event->probes, list) {
> list_for_each_entry_safe(pos, tmp, &event->probes, list) {
> __unregister_trace_kprobe(tk);
> trace_probe_unlink(&tk->tp); /* This will call trace_probe_event_free() internally */
> free_trace_kprobe(tk);
> }

so calling trace_probe_event_free inside this loop is a problem,
because the loop iterates that trace_probe_event's probes list,
and last probe removed will trigger trace_probe_event_free, that
will free the list we iterate.. and we go down ;-)

so that's why I added new free function '__free_trace_kprobe'
that frees everything as free_trace_kprobe, but does not call
trace_probe_unlink

event = tk->tp.event;
list_for_each_entry_safe(pos, tmp, &event->probes, list) {
list_for_each_entry_safe(pos, tmp, &event->probes, list) {
list_del_init(&pos->list);
__unregister_trace_kprobe(tk);
__free_trace_kprobe(tk);
}

trace_probe_event_free(event);

and there's trace_probe_event_free(event) to make the final free

thanks,
jirka