Re: [PATCH bpf-next v2 0/4] Add ftrace direct call for arm64

From: Jiri Olsa
Date: Wed Oct 05 2022 - 18:12:33 EST


On Wed, Oct 05, 2022 at 11:30:19AM -0400, Steven Rostedt wrote:
> On Wed, 5 Oct 2022 17:10:33 +0200
> Florent Revest <revest@xxxxxxxxxxxx> wrote:
>
> > On Wed, Oct 5, 2022 at 5:07 PM Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
> > >
> > > On Wed, 5 Oct 2022 22:54:15 +0800
> > > Xu Kuohai <xukuohai@xxxxxxxxxx> wrote:
> > >
> > > > 1.3 attach bpf prog with with direct call, bpftrace -e 'kfunc:vfs_write {}'
> > > >
> > > > # dd if=/dev/zero of=/dev/null count=1000000
> > > > 1000000+0 records in
> > > > 1000000+0 records out
> > > > 512000000 bytes (512 MB, 488 MiB) copied, 1.72973 s, 296 MB/s
> > > >
> > > >
> > > > 1.4 attach bpf prog with with indirect call, bpftrace -e 'kfunc:vfs_write {}'
> > > >
> > > > # dd if=/dev/zero of=/dev/null count=1000000
> > > > 1000000+0 records in
> > > > 1000000+0 records out
> > > > 512000000 bytes (512 MB, 488 MiB) copied, 1.99179 s, 257 MB/s
> >
> > Thanks for the measurements Xu!
> >
> > > Can you show the implementation of the indirect call you used?
> >
> > Xu used my development branch here
> > https://github.com/FlorentRevest/linux/commits/fprobe-min-args

nice :) I guess you did not try to run it on x86, I had to add some small
changes and disable HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS to compile it

>
> That looks like it could be optimized quite a bit too.
>
> Specifically this part:
>
> static bool bpf_fprobe_entry(struct fprobe *fp, unsigned long ip, struct ftrace_regs *regs, void *private)
> {
> struct bpf_fprobe_call_context *call_ctx = private;
> struct bpf_fprobe_context *fprobe_ctx = fp->ops.private;
> struct bpf_tramp_links *links = fprobe_ctx->links;
> struct bpf_tramp_links *fentry = &links[BPF_TRAMP_FENTRY];
> struct bpf_tramp_links *fmod_ret = &links[BPF_TRAMP_MODIFY_RETURN];
> struct bpf_tramp_links *fexit = &links[BPF_TRAMP_FEXIT];
> int i, ret;
>
> memset(&call_ctx->ctx, 0, sizeof(call_ctx->ctx));
> call_ctx->ip = ip;
> for (i = 0; i < fprobe_ctx->nr_args; i++)
> call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
>
> for (i = 0; i < fentry->nr_links; i++)
> call_bpf_prog(fentry->links[i], &call_ctx->ctx, call_ctx->args);
>
> call_ctx->args[fprobe_ctx->nr_args] = 0;
> for (i = 0; i < fmod_ret->nr_links; i++) {
> ret = call_bpf_prog(fmod_ret->links[i], &call_ctx->ctx,
> call_ctx->args);
>
> if (ret) {
> ftrace_regs_set_return_value(regs, ret);
> ftrace_override_function_with_return(regs);
>
> bpf_fprobe_exit(fp, ip, regs, private);
> return false;
> }
> }
>
> return fexit->nr_links;
> }
>
> There's a lot of low hanging fruit to speed up there. I wouldn't be too
> fast to throw out this solution if it hasn't had the care that direct calls
> have had to speed that up.
>
> For example, trampolines currently only allow to attach to functions with 6
> parameters or less (3 on x86_32). You could make 7 specific callbacks, with
> zero to 6 parameters, and unroll the argument loop.
>
> Would also be interesting to run perf to see where the overhead is. There
> may be other locations to work on to make it almost as fast as direct
> callers without the other baggage.

I can boot the change and run tests in qemu but for some reason it
won't boot on hw, so I have just perf report from qemu so far

there's fprobe/rethook machinery showing out as expected

jirka


---
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 23K of event 'cpu-clock:k'
# Event count (approx.): 5841250000
#
# Overhead Command Shared Object Symbol
# ........ ....... .............................................. ..................................................
#
18.65% bench [kernel.kallsyms] [k] syscall_enter_from_user_mode
|
---syscall_enter_from_user_mode
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

13.03% bench [kernel.kallsyms] [k] seqcount_lockdep_reader_access.constprop.0
|
---seqcount_lockdep_reader_access.constprop.0
ktime_get_coarse_real_ts64
syscall_trace_enter.constprop.0
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

9.49% bench [kernel.kallsyms] [k] rethook_try_get
|
---rethook_try_get
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

8.71% bench [kernel.kallsyms] [k] rethook_recycle
|
---rethook_recycle
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

4.31% bench [kernel.kallsyms] [k] rcu_is_watching
|
---rcu_is_watching
|
|--1.49%--rethook_try_get
| fprobe_handler
| ftrace_trampoline
| __x64_sys_getpgid
| do_syscall_64
| entry_SYSCALL_64_after_hwframe
| syscall
|
|--1.10%--do_getpgid
| __x64_sys_getpgid
| do_syscall_64
| entry_SYSCALL_64_after_hwframe
| syscall
|
|--1.02%--__bpf_prog_exit
| call_bpf_prog.isra.0
| bpf_fprobe_entry
| fprobe_handler
| ftrace_trampoline
| __x64_sys_getpgid
| do_syscall_64
| entry_SYSCALL_64_after_hwframe
| syscall
|
--0.70%--__bpf_prog_enter
call_bpf_prog.isra.0
bpf_fprobe_entry
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

2.94% bench [kernel.kallsyms] [k] lock_release
|
---lock_release
|
|--1.51%--call_bpf_prog.isra.0
| bpf_fprobe_entry
| fprobe_handler
| ftrace_trampoline
| __x64_sys_getpgid
| do_syscall_64
| entry_SYSCALL_64_after_hwframe
| syscall
|
--1.43%--do_getpgid
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

2.91% bench bpf_prog_21856463590f61f1_bench_trigger_fentry [k] bpf_prog_21856463590f61f1_bench_trigger_fentry
|
---bpf_prog_21856463590f61f1_bench_trigger_fentry
|
--2.66%--call_bpf_prog.isra.0
bpf_fprobe_entry
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

2.69% bench [kernel.kallsyms] [k] bpf_fprobe_entry
|
---bpf_fprobe_entry
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

2.60% bench [kernel.kallsyms] [k] lock_acquire
|
---lock_acquire
|
|--1.34%--__bpf_prog_enter
| call_bpf_prog.isra.0
| bpf_fprobe_entry
| fprobe_handler
| ftrace_trampoline
| __x64_sys_getpgid
| do_syscall_64
| entry_SYSCALL_64_after_hwframe
| syscall
|
--1.24%--do_getpgid
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

2.42% bench [kernel.kallsyms] [k] syscall_exit_to_user_mode_prepare
|
---syscall_exit_to_user_mode_prepare
syscall_exit_to_user_mode
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

2.37% bench [kernel.kallsyms] [k] __audit_syscall_entry
|
---__audit_syscall_entry
syscall_trace_enter.constprop.0
do_syscall_64
entry_SYSCALL_64_after_hwframe
|
--2.36%--syscall

2.35% bench [kernel.kallsyms] [k] syscall_trace_enter.constprop.0
|
---syscall_trace_enter.constprop.0
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

2.12% bench [kernel.kallsyms] [k] check_preemption_disabled
|
---check_preemption_disabled
|
--1.55%--rcu_is_watching
|
--0.59%--do_getpgid
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

2.00% bench [kernel.kallsyms] [k] fprobe_handler
|
---fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

1.94% bench [kernel.kallsyms] [k] local_irq_disable_exit_to_user
|
---local_irq_disable_exit_to_user
syscall_exit_to_user_mode
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

1.84% bench [kernel.kallsyms] [k] rcu_read_lock_sched_held
|
---rcu_read_lock_sched_held
|
|--0.93%--lock_acquire
|
--0.90%--lock_release

1.71% bench [kernel.kallsyms] [k] migrate_enable
|
---migrate_enable
__bpf_prog_exit
call_bpf_prog.isra.0
bpf_fprobe_entry
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

1.66% bench [kernel.kallsyms] [k] call_bpf_prog.isra.0
|
---call_bpf_prog.isra.0
bpf_fprobe_entry
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

1.53% bench [kernel.kallsyms] [k] __rcu_read_unlock
|
---__rcu_read_unlock
|
|--0.86%--__bpf_prog_exit
| call_bpf_prog.isra.0
| bpf_fprobe_entry
| fprobe_handler
| ftrace_trampoline
| __x64_sys_getpgid
| do_syscall_64
| entry_SYSCALL_64_after_hwframe
| syscall
|
--0.66%--do_getpgid
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

1.31% bench [kernel.kallsyms] [k] debug_smp_processor_id
|
---debug_smp_processor_id
|
--0.77%--rcu_is_watching

1.22% bench [kernel.kallsyms] [k] migrate_disable
|
---migrate_disable
__bpf_prog_enter
call_bpf_prog.isra.0
bpf_fprobe_entry
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

1.19% bench [kernel.kallsyms] [k] __bpf_prog_enter
|
---__bpf_prog_enter
call_bpf_prog.isra.0
bpf_fprobe_entry
fprobe_handler
ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

0.84% bench [kernel.kallsyms] [k] __radix_tree_lookup
|
---__radix_tree_lookup
find_task_by_pid_ns
do_getpgid
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

0.82% bench [kernel.kallsyms] [k] do_getpgid
|
---do_getpgid
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

0.78% bench [kernel.kallsyms] [k] debug_lockdep_rcu_enabled
|
---debug_lockdep_rcu_enabled
|
--0.63%--rcu_read_lock_sched_held

0.74% bench ftrace_trampoline [k] ftrace_trampoline
|
---ftrace_trampoline
__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

0.72% bench [kernel.kallsyms] [k] preempt_count_add
|
---preempt_count_add

0.71% bench [kernel.kallsyms] [k] ktime_get_coarse_real_ts64
|
---ktime_get_coarse_real_ts64
syscall_trace_enter.constprop.0
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

0.69% bench [kernel.kallsyms] [k] do_syscall_64
|
---do_syscall_64
entry_SYSCALL_64_after_hwframe
|
--0.68%--syscall

0.60% bench [kernel.kallsyms] [k] preempt_count_sub
|
---preempt_count_sub

0.59% bench [kernel.kallsyms] [k] __rcu_read_lock
|
---__rcu_read_lock

0.59% bench [kernel.kallsyms] [k] __x64_sys_getpgid
|
---__x64_sys_getpgid
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

0.58% bench [kernel.kallsyms] [k] __audit_syscall_exit
|
---__audit_syscall_exit
syscall_exit_to_user_mode_prepare
syscall_exit_to_user_mode
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

0.53% bench [kernel.kallsyms] [k] audit_reset_context
|
---audit_reset_context
syscall_exit_to_user_mode_prepare
syscall_exit_to_user_mode
do_syscall_64
entry_SYSCALL_64_after_hwframe
syscall

0.45% bench [kernel.kallsyms] [k] rcu_read_lock_held
0.36% bench [kernel.kallsyms] [k] find_task_by_vpid
0.32% bench [kernel.kallsyms] [k] __bpf_prog_exit
0.26% bench [kernel.kallsyms] [k] syscall_exit_to_user_mode
0.20% bench [kernel.kallsyms] [k] idr_find
0.18% bench [kernel.kallsyms] [k] find_task_by_pid_ns
0.17% bench [kernel.kallsyms] [k] update_prog_stats
0.16% bench [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
0.14% bench [kernel.kallsyms] [k] pid_task
0.04% bench [kernel.kallsyms] [k] memchr_inv
0.04% bench [kernel.kallsyms] [k] smp_call_function_many_cond
0.03% bench [kernel.kallsyms] [k] do_user_addr_fault
0.03% bench [kernel.kallsyms] [k] kallsyms_expand_symbol.constprop.0
0.03% bench [kernel.kallsyms] [k] native_flush_tlb_global
0.03% bench [kernel.kallsyms] [k] __change_page_attr_set_clr
0.02% bench [kernel.kallsyms] [k] memcpy_erms
0.02% bench [kernel.kallsyms] [k] unwind_next_frame
0.02% bench [kernel.kallsyms] [k] copy_user_enhanced_fast_string
0.01% bench [kernel.kallsyms] [k] __orc_find
0.01% bench [kernel.kallsyms] [k] call_rcu
0.01% bench [kernel.kallsyms] [k] __alloc_pages
0.01% bench [kernel.kallsyms] [k] __purge_vmap_area_lazy
0.01% bench [kernel.kallsyms] [k] __softirqentry_text_start
0.01% bench [kernel.kallsyms] [k] __stack_depot_save
0.01% bench [kernel.kallsyms] [k] __up_read
0.01% bench [kernel.kallsyms] [k] __virt_addr_valid
0.01% bench [kernel.kallsyms] [k] clear_page_erms
0.01% bench [kernel.kallsyms] [k] deactivate_slab
0.01% bench [kernel.kallsyms] [k] do_check_common
0.01% bench [kernel.kallsyms] [k] finish_task_switch.isra.0
0.01% bench [kernel.kallsyms] [k] free_unref_page_list
0.01% bench [kernel.kallsyms] [k] ftrace_rec_iter_next
0.01% bench [kernel.kallsyms] [k] handle_mm_fault
0.01% bench [kernel.kallsyms] [k] orc_find.part.0
0.01% bench [kernel.kallsyms] [k] try_charge_memcg
0.00% bench [kernel.kallsyms] [k] ___slab_alloc
0.00% bench [kernel.kallsyms] [k] __fdget_pos
0.00% bench [kernel.kallsyms] [k] __handle_mm_fault
0.00% bench [kernel.kallsyms] [k] __is_insn_slot_addr
0.00% bench [kernel.kallsyms] [k] __kmalloc
0.00% bench [kernel.kallsyms] [k] __mod_lruvec_page_state
0.00% bench [kernel.kallsyms] [k] __mod_node_page_state
0.00% bench [kernel.kallsyms] [k] __mutex_lock
0.00% bench [kernel.kallsyms] [k] __raw_spin_lock_init
0.00% bench [kernel.kallsyms] [k] alloc_vmap_area
0.00% bench [kernel.kallsyms] [k] allocate_slab
0.00% bench [kernel.kallsyms] [k] audit_get_tty
0.00% bench [kernel.kallsyms] [k] bpf_ksym_find
0.00% bench [kernel.kallsyms] [k] btf_check_all_metas
0.00% bench [kernel.kallsyms] [k] btf_put
0.00% bench [kernel.kallsyms] [k] cmpxchg_double_slab.constprop.0.isra.0
0.00% bench [kernel.kallsyms] [k] do_fault
0.00% bench [kernel.kallsyms] [k] do_raw_spin_trylock
0.00% bench [kernel.kallsyms] [k] find_vma
0.00% bench [kernel.kallsyms] [k] fs_reclaim_release
0.00% bench [kernel.kallsyms] [k] ftrace_check_record
0.00% bench [kernel.kallsyms] [k] ftrace_replace_code
0.00% bench [kernel.kallsyms] [k] get_mem_cgroup_from_mm
0.00% bench [kernel.kallsyms] [k] get_page_from_freelist
0.00% bench [kernel.kallsyms] [k] in_gate_area_no_mm
0.00% bench [kernel.kallsyms] [k] in_task_stack
0.00% bench [kernel.kallsyms] [k] kernel_text_address
0.00% bench [kernel.kallsyms] [k] kernfs_fop_read_iter
0.00% bench [kernel.kallsyms] [k] kernfs_put_active
0.00% bench [kernel.kallsyms] [k] kfree
0.00% bench [kernel.kallsyms] [k] kmem_cache_alloc
0.00% bench [kernel.kallsyms] [k] ksys_read
0.00% bench [kernel.kallsyms] [k] lookup_address_in_pgd
0.00% bench [kernel.kallsyms] [k] mlock_page_drain_local
0.00% bench [kernel.kallsyms] [k] page_remove_rmap
0.00% bench [kernel.kallsyms] [k] post_alloc_hook
0.00% bench [kernel.kallsyms] [k] preempt_schedule_irq
0.00% bench [kernel.kallsyms] [k] queue_work_on
0.00% bench [kernel.kallsyms] [k] stack_trace_save
0.00% bench [kernel.kallsyms] [k] within_error_injection_list


#
# (Tip: To record callchains for each sample: perf record -g)
#