Re: Getting empty callchain from perf_callchain_kernel()

From: Kairui Song
Date: Sun May 19 2019 - 14:09:25 EST


On Fri, May 17, 2019 at 5:10 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Fri, May 17, 2019 at 04:15:39PM +0800, Kairui Song wrote:
> > Hi, I think the actual problem is that bpf_get_stackid_tp (and maybe
> > some other bfp functions) is now broken, or, strating an unwind
> > directly inside a bpf program will end up strangely. It have following
> > kernel message:
>
> Urgh, what is that bpf_get_stackid_tp() doing to get the regs? I can't
> follow.

bpf_get_stackid_tp will just use the regs passed to it from the trace
point. And then it will eventually call perf_get_callchain to get the
call chain.
With a tracepoint we have the fake regs, so unwinder will start from
where it is called, and use the fake regs as the indicator of the
target frame it want, and keep unwinding until reached the actually
callsite.

But if the stack trace is started withing a bpf func call then it's broken...

If the unwinder could trace back through the bpf func call then there
will be no such problem.

For frame pointer unwinder, set the indicator flag (X86_EFLAGS_FIXED)
before bpf call, and ensure bp is also dumped could fix it (so it will
start using the regs for bpf calls, like before the commit
d15d356887e7). But for ORC I don't see a clear way to fix the problem.
First though is maybe dump some callee's regs for ORC (IP, BP, SP, DI,
DX, R10, R13, else?) in the trace point handler, then use the flag to
indicate ORC to do one more unwind (because can't get caller's regs,
so get callee's regs instaed) before actually giving output?

I had a try, for framepointer unwinder, mark the indicator flag before
calling bpf functions, and dump bp as well in the trace point. Then
with frame pointer, it works, test passed:

diff --git a/arch/x86/include/asm/perf_event.h
b/arch/x86/include/asm/perf_event.h
index 1392d5e6e8d6..6f1192e9776b 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -302,12 +302,25 @@ extern unsigned long perf_misc_flags(struct
pt_regs *regs);

#include <asm/stacktrace.h>

+#ifdef CONFIG_FRAME_POINTER
+static inline unsigned long caller_frame_pointer(void)
+{
+ return (unsigned long)__builtin_frame_address(1);
+}
+#else
+static inline unsigned long caller_frame_pointer(void)
+{
+ return 0;
+}
+#endif
+
/*
* We abuse bit 3 from flags to pass exact information, see perf_misc_flags
* and the comment with PERF_EFLAGS_EXACT.
*/
#define perf_arch_fetch_caller_regs(regs, __ip) { \
(regs)->ip = (__ip); \
+ (regs)->bp = caller_frame_pointer(); \
(regs)->sp = (unsigned long)__builtin_frame_address(0); \
(regs)->cs = __KERNEL_CS; \
regs->flags = 0; \
diff --git a/kernel/events/core.c b/kernel/events/core.c
index abbd4b3b96c2..ca7b95ee74f0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8549,6 +8549,7 @@ void perf_trace_run_bpf_submit(void *raw_data,
int size, int rctx,
struct task_struct *task)
{
if (bpf_prog_array_valid(call)) {
+ regs->flags |= X86_EFLAGS_FIXED;
*(struct pt_regs **)raw_data = regs;
if (!trace_call_bpf(call, raw_data) || hlist_empty(head)) {
perf_swevent_put_recursion_context(rctx);
@@ -8822,6 +8823,8 @@ static void bpf_overflow_handler(struct perf_event *event,
int ret = 0;

ctx.regs = perf_arch_bpf_user_pt_regs(regs);
+ ctx.regs->flags |= X86_EFLAGS_FIXED;
+
preempt_disable();
if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1))
goto out;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index f92d6ad5e080..e1fa656677dc 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -497,6 +497,8 @@ u64 bpf_event_output(struct bpf_map *map, u64
flags, void *meta, u64 meta_size,
};

perf_fetch_caller_regs(regs);
+ regs->flags |= X86_EFLAGS_FIXED;
+
perf_sample_data_init(sd, 0, 0);
sd->raw = &raw;

@@ -831,6 +833,8 @@ BPF_CALL_5(bpf_perf_event_output_raw_tp, struct
bpf_raw_tracepoint_args *, args,
struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);

perf_fetch_caller_regs(regs);
+ regs->flags |= X86_EFLAGS_FIXED;
+
return ____bpf_perf_event_output(regs, map, flags, data, size);
}

@@ -851,6 +855,8 @@ BPF_CALL_3(bpf_get_stackid_raw_tp, struct
bpf_raw_tracepoint_args *, args,
struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);

perf_fetch_caller_regs(regs);
+ regs->flags |= X86_EFLAGS_FIXED;
+
/* similar to bpf_perf_event_output_tp, but pt_regs fetched
differently */
return bpf_get_stackid((unsigned long) regs, (unsigned long) map,
flags, 0, 0);
@@ -871,6 +877,8 @@ BPF_CALL_4(bpf_get_stack_raw_tp, struct
bpf_raw_tracepoint_args *, args,
struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);

perf_fetch_caller_regs(regs);
+ regs->flags |= X86_EFLAGS_FIXED;
+
return bpf_get_stack((unsigned long) regs, (unsigned long) buf,
(unsigned long) size, flags, 0);
}

And *_raw_tp functions will fetch the regs by themselves so a bit
trouble some...

----------

And another approach is to make unwinder direct unwinding works when
called by bpf (if possible and reasonable). I also had a nasty hacky
experiment (posted below) to just force frame pointer unwinder's
get_stack_info pass for bpf, then problem is fixed without any other
workaround:

diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 753b8cfe8b8a..c0cfdf25f5ed 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -166,7 +166,8 @@ int get_stack_info(unsigned long *stack, struct
task_struct *task,
if (in_entry_stack(stack, info))
goto recursion_check;

- goto unknown;
+ goto recursion_check;

recursion_check:
/*

Don't know how to do it the right way, or is it even possible for all
unwinders yet...

--
Best Regards,
Kairui Song