On 2021/6/4 18:50, Qi Liu wrote:
This patch introduce optprobe for ARM64. In optprobe, probed
instruction is replaced by a branch instruction to detour
buffer. Detour buffer contains trampoline code and a call to
optimized_callback(). optimized_callback() calls opt_pre_handler()
to execute kprobe handler.
Limitations:
- We only support !CONFIG_RANDOMIZE_MODULE_REGION_FULL case to
guarantee the offset between probe point and kprobe pre_handler
is not larger than 128MiB.
Performance of optprobe on Hip08 platform is test using kprobe
example module[1] to analyze the latency of a kernel function,
and here is the result:
+ Jean-Philippe Brucker as well.
I assume both Jean and Robin expressed interest on having
an optprobe solution on ARM64 in a previous discussion
when I tried to add some tracepoints for debugging:
"[PATCH] iommu/arm-smmu-v3: add tracepoints for cmdq_issue_cmdlist"
https://lore.kernel.org/linux-arm-kernel/20200828083325.GC3825485@myrica/
https://lore.kernel.org/linux-arm-kernel/9acf1acf-19fb-26db-e908-eb4d4c666bae@xxxxxxx/
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/sa
[1]
mples/kprobes/kretprobe_example.c
kprobe before optimized:
[280709.846380] do_empty returned 0 and took 1530 ns to execute
[280709.852057] do_empty returned 0 and took 550 ns to execute
[280709.857631] do_empty returned 0 and took 440 ns to execute
[280709.863215] do_empty returned 0 and took 380 ns to execute
[280709.868787] do_empty returned 0 and took 360 ns to execute
[280709.874362] do_empty returned 0 and took 340 ns to execute
[280709.879936] do_empty returned 0 and took 320 ns to execute
[280709.885505] do_empty returned 0 and took 300 ns to execute
[280709.891075] do_empty returned 0 and took 280 ns to execute
[280709.896646] do_empty returned 0 and took 290 ns to execute
[280709.902220] do_empty returned 0 and took 290 ns to execute
[280709.907807] do_empty returned 0 and took 290 ns to execute
I used to see the same phenomenon when I used kprobe to debug
arm64 smmu driver. When a kprobe was executed for the first
time, it was crazily slow. But second time it became much faster
though it was still slow and affected the performance related
debugging negatively.
Not sure if it was due to hot cache or something. I didn't dig
into it.