Re: [PATCH v6] arm64: kprobe: Enable OPTPROBE for arm64

From: Qinxin Xia
Date: Wed Jan 15 2025 - 04:29:36 EST



在 2025/1/14 11:56, Yicong Yang 写道:
On 2025/1/3 9:27, Qinxin Xia wrote:
This patch introduce optprobe for ARM64. In optprobe, probed
instruction is replaced by a branch instruction to trampoline.

Performance of optprobe on Hip08 platform is test using kprobe
example module to analyze the latency of a kernel function,
and here is the result:

common kprobe:
[280709.846380] do_empty returned 0 and took 1530 ns to execute
[280709.852057] do_empty returned 0 and took 550 ns to execute
[280709.857631] do_empty returned 0 and took 440 ns to execute
[280709.863215] do_empty returned 0 and took 380 ns to execute
[280709.868787] do_empty returned 0 and took 360 ns to execute
[280709.874362] do_empty returned 0 and took 340 ns to execute
[280709.879936] do_empty returned 0 and took 320 ns to execute
[280709.885505] do_empty returned 0 and took 300 ns to execute
[280709.891075] do_empty returned 0 and took 280 ns to execute
[280709.896646] do_empty returned 0 and took 290 ns to execute

optprobe:
[ 2965.964572] do_empty returned 0 and took 90 ns to execute
[ 2965.969952] do_empty returned 0 and took 80 ns to execute
[ 2965.975332] do_empty returned 0 and took 70 ns to execute
[ 2965.980714] do_empty returned 0 and took 60 ns to execute
[ 2965.986128] do_empty returned 0 and took 80 ns to execute
[ 2965.991507] do_empty returned 0 and took 70 ns to execute
[ 2965.996884] do_empty returned 0 and took 70 ns to execute
[ 2966.002262] do_empty returned 0 and took 80 ns to execute
[ 2966.007642] do_empty returned 0 and took 70 ns to execute
[ 2966.013020] do_empty returned 0 and took 70 ns to execute
[ 2966.018400] do_empty returned 0 and took 70 ns to execute

As the result shows, optprobe can greatly reduce the latency. Big
latency of common kprobe will significantly impact the real result
while doing performance analysis or debugging performance issues
in lab, so optprobe is useful in this scenario.

Co-developed-by: Qi Liu <liuqi115@xxxxxxxxxx>
Signed-off-by: Qi Liu <liuqi115@xxxxxxxxxx>
Signed-off-by: Qinxin Xia <xiaqinxin@xxxxxxxxxx>
---

Changes since V5:
- Address the comments from Masami, saves stack frames to obtain correct backtrace
and make an array of usage flags to manage the reserved OPT_SLOT_SIZE.
- Link: https://lore.kernel.org/lkml/20211207124002.59877-1-liuqi115@xxxxxxxxxx/

Changes since V4:
- Address the comments from Masami, update arch_prepare_optimized_kprobe,
if the probe address is out of limit return -ERANGE.
- Link: https://lore.kernel.org/lkml/20210818073336.59678-1-liuqi115@xxxxxxxxxx/

Changes since V3:
- Address the comments from Masami, reduce the number of aarch64_insn_patch_text
in arch_optimize_kprobes() and arch_unoptimize_kprobes().
- Link: https://lore.kernel.org/lkml/20210810055330.18924-1-liuqi115@xxxxxxxxxx/

Changes since V2:
- Address the comments from Masami, prepare another writable buffer in
arch_prepare_optimized_kprobe()and build the trampoline code on it.
- Address the comments from Amit, move save_all_base_regs and
restore_all_base_regs to <asm/assembler.h>, as these two macros are reused
in optprobe.
- Link: https://lore.kernel.org/lkml/20210804060209.95817-1-liuqi115@xxxxxxxxxx/

Changes since V1:
- Address the comments from Masami, checks for all branch instructions, and
use aarch64_insn_patch_text_nosync() instead of aarch64_insn_patch_text()
in each probe.
- Link: https://lore.kernel.org/lkml/20210719122417.10355-1-liuqi115@xxxxxxxxxx/
---
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/kprobes.h | 22 ++
arch/arm64/kernel/probes/Makefile | 2 +
arch/arm64/kernel/probes/opt_arm64.c | 216 ++++++++++++++++++
.../arm64/kernel/probes/optprobe_trampoline.S | 112 +++++++++
kernel/kprobes.c | 18 ++
6 files changed, 371 insertions(+)
create mode 100644 arch/arm64/kernel/probes/opt_arm64.c
create mode 100644 arch/arm64/kernel/probes/optprobe_trampoline.S

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 100570a048c5..f9c4e2625595 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -244,6 +244,7 @@ config ARM64
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_KPROBES
select HAVE_KRETPROBES
+ select HAVE_OPTPROBES
select HAVE_GENERIC_VDSO
select HOTPLUG_CORE_SYNC_DEAD if HOTPLUG_CPU
select IRQ_DOMAIN
diff --git a/arch/arm64/include/asm/kprobes.h b/arch/arm64/include/asm/kprobes.h
index be7a3680dadf..bd4973bfb58d 100644
--- a/arch/arm64/include/asm/kprobes.h
+++ b/arch/arm64/include/asm/kprobes.h
@@ -37,6 +37,28 @@ struct kprobe_ctlblk {
void arch_remove_kprobe(struct kprobe *);
int kprobe_fault_handler(struct pt_regs *regs, unsigned int fsr);
+
+struct arch_optimized_insn {
+ kprobe_opcode_t orig_insn[1];
+ kprobe_opcode_t *trampoline;
+};
+
+#define MAX_OPTIMIZED_LENGTH sizeof(kprobe_opcode_t)
+#define MAX_OPTINSN_SIZE \
+ ((unsigned long)optprobe_template_end - (unsigned long)optprobe_template_entry)
+
+extern __visible kprobe_opcode_t optprobe_template_entry[];
+extern __visible kprobe_opcode_t optprobe_template_val[];
+extern __visible kprobe_opcode_t optprobe_template_orig_addr[];
+extern __visible kprobe_opcode_t optprobe_template_common[];
+extern __visible kprobe_opcode_t optprobe_template_end[];
+extern __visible kprobe_opcode_t optprobe_template_restore_begin[];
+extern __visible kprobe_opcode_t optprobe_template_restore_orig_insn[];
+extern __visible kprobe_opcode_t optprobe_template_restore_end[];
+extern __visible kprobe_opcode_t optinsn_slot[];
+
+void optprobe_common(void);
+
void __kretprobe_trampoline(void);
void __kprobes *trampoline_probe_handler(struct pt_regs *regs);
diff --git a/arch/arm64/kernel/probes/Makefile b/arch/arm64/kernel/probes/Makefile
index 8e4be92e25b1..7b2885b23ff6 100644
--- a/arch/arm64/kernel/probes/Makefile
+++ b/arch/arm64/kernel/probes/Makefile
@@ -4,3 +4,5 @@ obj-$(CONFIG_KPROBES) += kprobes.o decode-insn.o \
simulate-insn.o
obj-$(CONFIG_UPROBES) += uprobes.o decode-insn.o \
simulate-insn.o
+obj-$(CONFIG_OPTPROBES) += opt_arm64.o \
+ optprobe_trampoline.o
diff --git a/arch/arm64/kernel/probes/opt_arm64.c b/arch/arm64/kernel/probes/opt_arm64.c
new file mode 100644
index 000000000000..fa8b68afbbf3
--- /dev/null
+++ b/arch/arm64/kernel/probes/opt_arm64.c
@@ -0,0 +1,216 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Code for Kernel probes Jump optimization.
+ *
+ * Copyright (C) 2024 Hisilicon Limited
s/Hisilicon/HiSilicon

the date needs update as well now
ok, I will fix it in the next version.

+ */
+
+#include <linux/jump_label.h>
+#include <linux/kprobes.h>
+#include <linux/wordpart.h>
+
+#include <asm/cacheflush.h>
+#include <asm/compiler.h>
+#include <asm/insn.h>
+#include <asm/kprobes.h>
+#include <asm/patching.h>
+
+#define OPTPROBE_BATCH_SIZE 64
+
+#define TMPL_VAL_IDX \
+ (optprobe_template_val - optprobe_template_entry)
+#define TMPL_ORIGN_ADDR \
+ (optprobe_template_orig_addr - optprobe_template_entry)
+#define TMPL_CALL_COMMON \
+ (optprobe_template_common - optprobe_template_entry)
+#define TMPL_RESTORE_ORIGN_INSN \
+ (optprobe_template_restore_orig_insn - optprobe_template_entry)
+#define TMPL_RESTORE_END \
+ (optprobe_template_restore_end - optprobe_template_entry)
+
+#define OPT_SLOT_SIZE 65536
+#define OPT_INSN_PAGES (OPT_SLOT_SIZE / PAGE_SIZE)
+
+static bool insn_page_in_use[OPT_INSN_PAGES];
+
+void *alloc_optinsn_page(void)
+{
+ int i;
+
+ for (i = 0; i < OPT_INSN_PAGES; i++) {
+ if (!insn_page_in_use[i]) {
+ insn_page_in_use[i] = true;
+ return (void *)((unsigned long)optinsn_slot + PAGE_SIZE * i);
+ }
+ }
+
+ return NULL;
+}
+
+void free_optinsn_page(void *page)
+{
+ unsigned long idx = (unsigned long)page - (unsigned long)optinsn_slot;
+
+ WARN_ONCE(idx & (PAGE_SIZE - 1), "Invalid idx with wrong align\n");
+ idx >>= PAGE_SHIFT;
+ if (WARN_ONCE(idx >= OPT_INSN_PAGES, "Invalid idx with wrong size\n"))
+ return;
+ insn_page_in_use[idx] = false;
+}
+
+int arch_check_optimized_kprobe(struct optimized_kprobe *op)
+{
+ return 0;
may need to add some comments to say why we don't need a check here (maybe the
similiar reason to powerpc?)
yeah, I will add comment in next version.
+}
+
+int arch_prepared_optinsn(struct arch_optimized_insn *optinsn)
+{
+ return optinsn->trampoline != NULL;
+}
+
+int arch_within_optimized_kprobe(struct optimized_kprobe *op, kprobe_opcode_t *addr)
+{
+ return op->kp.addr == addr;
+}
+
+static void optprobe_set_pc_value(struct optimized_kprobe *op, struct pt_regs *regs)
+{
+ regs->pc = (unsigned long)op->kp.addr;
+}
+
+static int optprobe_check_branch_limit(unsigned long pc, unsigned long addr)
+{
+ long offset;
+
+ if ((pc & 0x3) || (addr & 0x3))
+ return -ERANGE;
+
+ offset = (long)addr - (long)pc;
+ if (offset < -SZ_128M || offset >= SZ_128M)
+ return -ERANGE;
+
+ return 0;
+}
+
+int arch_prepare_optimized_kprobe(struct optimized_kprobe *op, struct kprobe *orig)
+{
+ kprobe_opcode_t *code, *buf;
+ int ret = -ENOMEM;
+ u32 insn;
+ int i;
+
+ buf = kzalloc(MAX_OPTINSN_SIZE, GFP_KERNEL);
+ if (!buf)
+ return ret;
+
+ code = get_optinsn_slot();
+ if (!code)
+ goto out;
+
+ if (optprobe_check_branch_limit((unsigned long)code, (unsigned long)orig->addr + 8)) {
+ ret = -ERANGE;
+ goto error;
+ }
+
+ memcpy(buf, optprobe_template_entry, MAX_OPTINSN_SIZE);
+
+ insn = aarch64_insn_gen_branch_imm((unsigned long)&code[TMPL_CALL_COMMON],
+ (unsigned long)&optprobe_common,
+ AARCH64_INSN_BRANCH_LINK);
+ buf[TMPL_CALL_COMMON] = insn;
+
+ insn = aarch64_insn_gen_branch_imm((unsigned long)&code[TMPL_RESTORE_END],
+ (unsigned long)(op->kp.addr + 1),
+ AARCH64_INSN_BRANCH_NOLINK);
we may need to check the result of aarch64_insn_gen_branch_imm(). Not all the return value
is a legal instruction.
ok, I will fix it in the next version.

+ buf[TMPL_RESTORE_END] = insn;
+
+ buf[TMPL_VAL_IDX] = cpu_to_le32(lower_32_bits((unsigned long)op));
+ buf[TMPL_VAL_IDX + 1] = cpu_to_le32(upper_32_bits((unsigned long)op));
+ buf[TMPL_ORIGN_ADDR] = cpu_to_le32(lower_32_bits((unsigned long)orig->addr));
+ buf[TMPL_ORIGN_ADDR + 1] = cpu_to_le32(upper_32_bits((unsigned long)orig->addr));
+
+ buf[TMPL_RESTORE_ORIGN_INSN] = orig->opcode;
+
+ /* Setup template */
+ for (i = 0; i < MAX_OPTINSN_SIZE / MAX_OPTIMIZED_LENGTH; i++)
+ aarch64_insn_patch_text_nosync(code + i, buf[i]);
+
+ flush_icache_range((unsigned long)code, (unsigned long)(&code[TMPL_VAL_IDX]));
+ /* Set op->optinsn.trampoline means prepared. */
+ op->optinsn.trampoline = code;
+
+ return 0;
+error:
+ free_optinsn_slot(code, 0);
+
+out:
+ kfree(buf);
+ return ret;
+}
+
+void arch_optimize_kprobes(struct list_head *oplist)
+{
+ struct optimized_kprobe *op, *tmp;
+ kprobe_opcode_t insns[OPTPROBE_BATCH_SIZE];
+ void *addrs[OPTPROBE_BATCH_SIZE];
+ int i = 0;
+
+ list_for_each_entry_safe(op, tmp, oplist, list) {
+ WARN_ON(kprobe_disabled(&op->kp));
+
+ /*
+ * Backup instructions which will be replaced
+ * by jump address
+ */
+ memcpy(op->optinsn.orig_insn, op->kp.addr, AARCH64_INSN_SIZE);
+
+ addrs[i] = op->kp.addr;
+ insns[i] = aarch64_insn_gen_branch_imm((unsigned long)op->kp.addr,
+ (unsigned long)op->optinsn.trampoline,
+ AARCH64_INSN_BRANCH_NOLINK);
+
+ list_del_init(&op->list);
+ if (++i == OPTPROBE_BATCH_SIZE)
+ break;
+ }
+
+ aarch64_insn_patch_text(addrs, insns, i);
+}
+
+void arch_unoptimize_kprobe(struct optimized_kprobe *op)
+{
+ arch_arm_kprobe(&op->kp);
+}
+
+/*
+ * Recover original instructions and breakpoints from relative jumps.
+ * Caller must call with locking kprobe_mutex.
+ */
+void arch_unoptimize_kprobes(struct list_head *oplist,
+ struct list_head *done_list)
+{
+ struct optimized_kprobe *op, *tmp;
+ kprobe_opcode_t insns[OPTPROBE_BATCH_SIZE];
+ void *addrs[OPTPROBE_BATCH_SIZE];
+ int i = 0;
+
+ list_for_each_entry_safe(op, tmp, oplist, list) {
+ addrs[i] = op->kp.addr;
+ insns[i] = BRK64_OPCODE_KPROBES;
+ list_move(&op->list, done_list);
+
+ if (++i == OPTPROBE_BATCH_SIZE)
+ break;
+ }
+
+ aarch64_insn_patch_text(addrs, insns, i);
+}
+
+void arch_remove_optimized_kprobe(struct optimized_kprobe *op)
+{
+ if (op->optinsn.trampoline) {
+ free_optinsn_slot(op->optinsn.trampoline, 1);
+ op->optinsn.trampoline = NULL;
+ }
+
+}
diff --git a/arch/arm64/kernel/probes/optprobe_trampoline.S b/arch/arm64/kernel/probes/optprobe_trampoline.S
new file mode 100644
index 000000000000..ba936b2efb5e
--- /dev/null
+++ b/arch/arm64/kernel/probes/optprobe_trampoline.S
@@ -0,0 +1,112 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * trampoline entry and return code for optprobes.
+ */
+
+#include <linux/linkage.h>
+#include <asm/asm-offsets.h>
+#include <asm/assembler.h>
+
+#define OPT_SLOT_SIZE 65536
+
+ .global optinsn_slot
+optinsn_slot:
+ .space OPT_SLOT_SIZE
+
+SYM_CODE_START(optprobe_common)
+ stp x2, x3, [sp, #S_X2]
+ stp x4, x5, [sp, #S_X4]
+ stp x6, x7, [sp, #S_X6]
+ stp x8, x9, [sp, #S_X8]
+ stp x10, x11, [sp, #S_X10]
+ stp x12, x13, [sp, #S_X12]
+ stp x14, x15, [sp, #S_X14]
+ stp x16, x17, [sp, #S_X16]
+ stp x18, x19, [sp, #S_X18]
+ stp x20, x21, [sp, #S_X20]
+ stp x22, x23, [sp, #S_X22]
+ stp x24, x25, [sp, #S_X24]
+ stp x26, x27, [sp, #S_X26]
+ stp x28, x29, [sp, #S_X28]
+ add x2, sp, #PT_REGS_SIZE
+ str x2, [sp, #S_SP]
+ /* Construct a useful saved PSTATE */
+ mrs x2, nzcv
+ mrs x3, daif
+ orr x2, x2, x3
+ mrs x3, CurrentEL
+ orr x2, x2, x3
+ mrs x3, SPSel
+ orr x2, x2, x3
+ ldr x1, 2f
+ stp x1, x2, [sp, #S_PC]
+
+ /* set the pt_regs address to x1 */
+ mov x1, sp
+ /* store lr of optprobe_common temporary */
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
+
+ bl optprobe_optimized_callback
+
+ ldp x29, x30, [sp], #16
+
+ ldr x0, [sp, #S_PSTATE]
+ and x0, x0, #(PSR_N_BIT | PSR_Z_BIT | PSR_C_BIT | PSR_V_BIT)
+ msr nzcv, x0
+
+ ldp x0, x1, [sp, #S_X0]
+ ldp x2, x3, [sp, #S_X2]
+ ldp x4, x5, [sp, #S_X4]
+ ldp x6, x7, [sp, #S_X6]
+ ldp x8, x9, [sp, #S_X8]
+ ldp x10, x11, [sp, #S_X10]
+ ldp x12, x13, [sp, #S_X12]
+ ldp x14, x15, [sp, #S_X14]
+ ldp x16, x17, [sp, #S_X16]
+ ldp x18, x19, [sp, #S_X18]
+ ldp x20, x21, [sp, #S_X20]
+ ldp x22, x23, [sp, #S_X22]
+ ldp x24, x25, [sp, #S_X24]
+ ldp x26, x27, [sp, #S_X26]
+ ldp x28, x29, [sp, #S_X28]
+ ret
+SYM_CODE_END(optprobe_common)
+
+ .global optprobe_template_entry
+optprobe_template_entry:
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
+ ldr x30, 2f
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
+ sub sp, sp, #PT_REGS_SIZE
+ str lr, [sp, #S_LR]
+ stp x0, x1, [sp, #S_X0]
+ /* Get parameters to optimized_callback() */
+ ldr x0, 1f
+ .global optprobe_template_common
+optprobe_template_common:
+ nop
+ ldr lr, [sp, #S_LR]
+ add sp, sp, #PT_REGS_SIZE
+ ldp x29, x30, [sp], #16
+ ldp x29, x30, [sp], #16
+ .global optprobe_template_restore_orig_insn
+optprobe_template_restore_orig_insn:
+ nop
+ .global optprobe_template_restore_end
+optprobe_template_restore_end:
+ nop
+ .balign
+ .global optprobe_template_val
+optprobe_template_val:
+ 1: .long 0
+ .long 0
+ .balign
+ .global optprobe_template_orig_addr
+optprobe_template_orig_addr:
+ 2: .long 0
+ .long 0
+ .global optprobe_template_end
a "nop" instruction is needed around here (also similar to what powerpc does) or need other
ways to fix the symbol collision like:
git:(local) grep -B 1 optprobe_template_end System.map
ffff80008004ff40 t __secondary_switched
ffff80008004ff40 T optprobe_template_end

it'll lead to a confusing call stack like:
[...]
[ 510.939204][ C11] Call trace:
[ 510.939221][ C11] cpu_do_idle+0x10/0x68
[ 510.939299][ C11] acpi_idle_lpi_enter+0x50/0x78
[ 510.939364][ C11] cpuidle_enter_state+0x88/0x478
[ 510.939390][ C11] cpuidle_enter+0x40/0x60
[ 510.939421][ C11] cpuidle_idle_call+0x138/0x1e8
[ 510.939481][ C11] do_idle+0xac/0x118
[ 510.939534][ C11] cpu_startup_entry+0x40/0x50
[ 510.939588][ C11] secondary_start_kernel+0x14c/0x1e8
[ 510.939614][ C11] optprobe_template_end+0xb8/0xc0 // should be __secondary_switched
ok, I will fix it in the next version.
+optprobe_template_end:
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 62b5b08d809d..45287e185ce3 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -425,6 +425,24 @@ void opt_pre_handler(struct kprobe *p, struct pt_regs *regs)
}
NOKPROBE_SYMBOL(opt_pre_handler);
+void optprobe_optimized_callback(struct optimized_kprobe *op, struct pt_regs *regs)
+{
+ if (kprobe_disabled(&op->kp))
+ return;
+
+ guard(preempt)();
+
+ if (kprobe_running()) {
+ kprobes_inc_nmissed_count(&op->kp);
+ } else {
+ __this_cpu_write(current_kprobe, &op->kp);
+ get_kprobe_ctlblk()->kprobe_status = KPROBE_HIT_ACTIVE;
+ opt_pre_handler(&op->kp, regs);
+ __this_cpu_write(current_kprobe, NULL);
+ }
+}
+NOKPROBE_SYMBOL(optprobe_optimized_callback)
+
should we make this function arch specific rather than in the framework? it's only
used in the arm64 code in this patch currently.

Thanks.

ok, I will convert it to a private property for arm64.

Thank you.

/* Free optimized instructions and optimized_kprobe */
static void free_aggr_kprobe(struct kprobe *p)
{