Re: Getting empty callchain from perf_callchain_kernel()
From: Song Liu
Date: Mon May 20 2019 - 13:19:52 EST
> On May 19, 2019, at 11:06 AM, Kairui Song <kasong@xxxxxxxxxx> wrote:
>
> On Fri, May 17, 2019 at 5:10 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>>
>> On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote:
>>> Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe
>>> some other bfp functions) is now broken, or, strating an unwind
>>> directly inside a bpf program will end up strangely. It have following
>>> kernel message:
>>
>> Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't
>> follow.
>
> bpf_get_stackid_tp will just use the regs passed to it from the trace
> point. And then it will eventually call perf_get_callchain to get the
> call chain.
> With a tracepoint we have the fake regs, so unwinder will start from
> where it is called, and use the fake regs as the indicator of the
> target frame it want, and keep unwinding until reached the actually
> callsite.
>
> But if the stack trace is started withing a bpf func call then it's broken...
>
> If the unwinder could trace back through the bpf func call then there
> will be no such problem.
>
> For frame pointer unwinder, set the indicator flag (X86_EFLAGS_FIXED)
> before bpf call, and ensure bp is also dumped could fix it (so it will
> start using the regs for bpf calls, like before the commit
> d15d356887e7). But for ORC I don't see a clear way to fix the problem.
> First though is maybe dump some callee's regs for ORC (IP, BP, SP, DI,
> DX, R10, R13, else?) in the trace point handler, then use the flag to
> indicate ORC to do one more unwind (because can't get caller's regs,
> so get callee's regs instaed) before actually giving output?
>
> I had a try, for framepointer unwinder, mark the indicator flag before
> calling bpf functions, and dump bp as well in the trace point. Then
> with frame pointer, it works, test passed:
>
> diff --git a/arch/x86/include/asm/perf_event.h
> b/arch/x86/include/asm/perf_event.h
> index 1392d5e6e8d6..6f1192e9776b 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -302,12 +302,25 @@ extern unsigned long perf_misc_flags(struct
> pt_regs *regs);
>
> #include <asm/stacktrace.h>
>
> +#ifdef CONFIG_FRAME_POINTER
> +static inline unsigned long caller_frame_pointer(void)
> +{
> + return (unsigned long)__builtin_frame_address(1);
> +}
> +#else
> +static inline unsigned long caller_frame_pointer(void)
> +{
> + return 0;
> +}
> +#endif
> +
> /*
> * We abuse bit 3 from flags to pass exact information, see perf_misc_flags
> * and the comment with PERF_EFLAGS_EXACT.
> */
> #define perf_arch_fetch_caller_regs(regs, __ip) { \
> (regs)->ip = (__ip); \
> + (regs)->bp = caller_frame_pointer(); \
> (regs)->sp = (unsigned long)__builtin_frame_address(0); \
> (regs)->cs = __KERNEL_CS; \
> regs->flags = 0; \
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index abbd4b3b96c2..ca7b95ee74f0 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -8549,6 +8549,7 @@ void perf_trace_run_bpf_submit(void *raw_data,
> int size, int rctx,
> struct task_struct *task)
> {
> if (bpf_prog_array_valid(call)) {
> + regs->flags |= X86_EFLAGS_FIXED;
> *(struct pt_regs **)raw_data = regs;
> if (!trace_call_bpf(call, raw_data) || hlist_empty(head)) {
> perf_swevent_put_recursion_context(rctx);
> @@ -8822,6 +8823,8 @@ static void bpf_overflow_handler(struct perf_event *event,
> int ret = 0;
>
> ctx.regs = perf_arch_bpf_user_pt_regs(regs);
> + ctx.regs->flags |= X86_EFLAGS_FIXED;
> +
> preempt_disable();
> if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1))
> goto out;
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index f92d6ad5e080..e1fa656677dc 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -497,6 +497,8 @@ u64 bpf_event_output(struct bpf_map *map, u64
> flags, void *meta, u64 meta_size,
> };
>
> perf_fetch_caller_regs(regs);
> + regs->flags |= X86_EFLAGS_FIXED;
> +
> perf_sample_data_init(sd, 0, 0);
> sd->raw = &raw;
>
> @@ -831,6 +833,8 @@ BPF_CALL_5(bpf_perf_event_output_raw_tp, struct
> bpf_raw_tracepoint_args *, args,
> struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);
>
> perf_fetch_caller_regs(regs);
> + regs->flags |= X86_EFLAGS_FIXED;
> +
> return ____bpf_perf_event_output(regs, map, flags, data, size);
> }
>
> @@ -851,6 +855,8 @@ BPF_CALL_3(bpf_get_stackid_raw_tp, struct
> bpf_raw_tracepoint_args *, args,
> struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);
>
> perf_fetch_caller_regs(regs);
> + regs->flags |= X86_EFLAGS_FIXED;
> +
> /* similar to bpf_perf_event_output_tp, but pt_regs fetched
> differently */
> return bpf_get_stackid((unsigned long) regs, (unsigned long) map,
> flags, 0, 0);
> @@ -871,6 +877,8 @@ BPF_CALL_4(bpf_get_stack_raw_tp, struct
> bpf_raw_tracepoint_args *, args,
> struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);
>
> perf_fetch_caller_regs(regs);
> + regs->flags |= X86_EFLAGS_FIXED;
> +
> return bpf_get_stack((unsigned long) regs, (unsigned long) buf,
> (unsigned long) size, flags, 0);
> }
>
> And *_raw_tp functions will fetch the regs by themselves so a bit
> trouble some...
>
> ----------
>
> And another approach is to make unwinder direct unwinding works when
> called by bpf (if possible and reasonable). I also had a nasty hacky
> experiment (posted below) to just force frame pointer unwinder's
> get_stack_info pass for bpf, then problem is fixed without any other
> workaround:
>
> diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
> index 753b8cfe8b8a..c0cfdf25f5ed 100644
> --- a/arch/x86/kernel/dumpstack_64.c
> +++ b/arch/x86/kernel/dumpstack_64.c
> @@ -166,7 +166,8 @@ int get_stack_info(unsigned long *stack, struct
> task_struct *task,
> if (in_entry_stack(stack, info))
> goto recursion_check;
>
> - goto unknown;
> + goto recursion_check;
>
> recursion_check:
> /*
>
> Don't know how to do it the right way, or is it even possible for all
> unwinders yet...
>
> --
> Best Regards,
> Kairui Song