Re: [PATCH v3 bpf-next 1/4] tracing/probe: Add PERF_EVENT_IOC_QUERY_PROBE ioctl

From: Yonghong Song
Date: Wed Aug 21 2019 - 14:44:11 EST




On 8/21/19 11:31 AM, Peter Zijlstra wrote:
> On Wed, Aug 21, 2019 at 04:54:47PM +0000, Yonghong Song wrote:
>> Currently, in kernel/trace/bpf_trace.c, we have
>>
>> unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx)
>> {
>> unsigned int ret;
>>
>> if (in_nmi()) /* not supported yet */
>> return 1;
>>
>> preempt_disable();
>>
>> if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {
>
> Yes, I'm aware of that.
>
>> In the above, the events with bpf program attached will be missed
>> if the context is nmi interrupt, or if some recursion happens even with
>> the same or different bpf programs.
>> In case of recursion, the events will not be sent to ring buffer.
>
> And while that is significantly worse than what ftrace/perf have, it is
> fundamentally the same thing.
>
> perf allows (and iirc ftrace does too) 4 nested context per CPU
> (task,softirq,irq,nmi) but any recursion within those context and we
> drop stuff.
>
> The BPF stuff is just more eager to drop things on the floor, but it is
> fundamentally the same.
>
>> A lot of bpf-based tracing programs uses maps to communicate and
>> do not allocate ring buffer at all.
>
> So extending PERF_RECORD_LOST doesn't work. But PERF_FORMAT_LOST might
> still work fine; but you get to implement it for all software events.

Could you give more specifics about PERF_FORMAT_LOST? Googling
"PERF_FORMAT_LOST" only yields two emails which we are discussing here :-(

>
>> Maybe we can still use ioctl based approach which is light weighted
>> compared to ring buffer approach? If a fd has bpf attached, nhit/nmisses
>> means the kprobe is processed by bpf program or not.
>
> There is nothing kprobe specific here. Kprobes just appear to be the
> only one actually accounting the recursion cases, but everyone has
> them.

Sorry to be specific, kprobe is just an example, I actually refers to
any perf event where bpf can attach to, which theoretically are any
perf events which can be opened with "perf_event_open" syscall although
some of them (e.g., software events?) may not have bpf running hooks yet.

>
>> Currently, for debugfs, the nhit/nmisses info is exposed at
>> {k|u}probe_profile. Alternative, we could expose the nhit/nmisses
>> in /proc/self/fdinfo/<fd>. User can query this interface to
>> get numbers.
>
> No, we're not adding stuff to procfs for this.

No problem. Just a suggestion.