Re: [PATCH] arm64: insn: Simulate nop and push instruction for better uprobe performance
From: Liao, Chang
Date: Mon Sep 09 2024 - 03:43:03 EST
在 2024/9/6 17:39, Mark Rutland 写道:
> On Tue, Aug 27, 2024 at 07:33:55PM +0800, Liao, Chang wrote:
>> Hi, Mark
>>
>> Would you like to discuss this patch further, or do you still believe emulating
>> STP to push FP/LR into the stack in kernel is not a good idea?
>
> I'm happy with the NOP emulation in principle, so please send a new
> version with *just* the NOP emulation, and I can review that.
Mark, please check v2 from link:
https://lore.kernel.org/all/20240909071114.1150053-1-liaochang1@xxxxxxxxxx/
Thanks.
>
> Regarding STP emulation, I stand by my earlier comments, and in addition
> to those comments, AFAICT it's currently unsafe to use any uaccess
> routine in the uprobe BRK handler anyway, so that's moot. The uprobe BRK
> handler runs with preemption disabled and IRQs (and all other maskable
> exceptions) masked, and faults cannot be handled. IIUC
> CONFIG_DEBUG_ATOMIC_SLEEP should scream about that.
>
> Longer-term I'm happy to look into making it possible to do uaccesses in
> the uprobe handlers, but that requires a much wider rework of the way we
> handle BRK instructions, single-step, and hardware breakpoints and
> watchpoints.
>
> Mark.
>
>> Thanks.
>>
>>
>> 在 2024/8/21 15:55, Liao, Chang 写道:
>>> Hi, Mark
>>>
>>> My bad for taking so long to rely, I generally agree with your suggestions to
>>> STP emulation.
>>>
>>> 在 2024/8/15 17:58, Mark Rutland 写道:
>>>> On Wed, Aug 14, 2024 at 08:03:56AM +0000, Liao Chang wrote:
>>>>> As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
>>>>> counterintuitive result that nop and push variants are much slower than
>>>>> ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
>>>>> which excludes 'nop' and 'stp' from the emulatable instructions list.
>>>>> This force the kernel returns to userspace and execute them out-of-line,
>>>>> then trapping back to kernel for running uprobe callback functions. This
>>>>> leads to a significant performance overhead compared to 'ret' variant,
>>>>> which is already emulated.
>>>>
>>>> I appreciate this might be surprising, but does it actually matter
>>>> outside of a microbenchmark?
>>>
>>> I just do a simple comparsion the performance impact of single-stepped and
>>> emulated STP on my local machine. Three user cases were measured: Redis GET and
>>> SET throughput (Request Per Second, RPS), and the time taken to execute a grep
>>> command on the "arch_uprobe_copy_xol" string within the kernel source.
>>>
>>> Redis GET (higher is better)
>>> ----------------------------
>>> No uprobe: 49149.71 RPS
>>> Single-stepped STP: 46750.82 RPS
>>> Emulated STP: 48981.19 RPS
>>>
>>> Redis SET (larger is better)
>>> ----------------------------
>>> No uprobe: 49761.14 RPS
>>> Single-stepped STP: 45255.01 RPS
>>> Emulated stp: 48619.21 RPS
>>>
>>> Grep (lower is better)
>>> ----------------------
>>> No uprobe: 2.165s
>>> Single-stepped STP: 15.314s
>>> Emualted STP: 2.216s
>>>
>>> The result reveals single-stepped STP instruction that used to push fp/lr into
>>> stack significantly impacts the Redis and grep performance, leading to a notable
>>> notable decrease RPS and increase time individually. So emulating STP on the
>>> function entry might be a more viable option for uprobe.
>>>
>>>>
>>>>> Typicall uprobe is installed on 'nop' for USDT and on function entry
>>>>> which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
>>>>> and fp into stack regardless kernel or userspace binary.
>>>>
>>>> Function entry doesn't always start with a STP; these days it's often a
>>>> BTI or PACIASP, and for non-leaf functions (or with shrink-wrapping in
>>>> the compiler), it could be any arbitrary instruction. This might happen
>>>> to be the common case today, but there are certain;y codebases where it
>>>> is not.
>>>
>>> Sure, if kernel, CPU and compiler support BTI and PAC, the entry instruction
>>> is definitly not STP. But for CPU and kernel lack of these supports, STP as
>>> the entry instruction is still the common case. And I profiled the entry
>>> instruction for all leaf and non-leaf function, the ratio of STP is 64.5%
>>> for redis, 76.1% for the BPF selftest bench. So I am thinking it is still
>>> useful to emulate the STP on the function entry. Perhaps, for CPU and kernel
>>> with BTI and PAC enabled, uprobe chooses the slower single-stepping to execute
>>> STP for pushing stack.
>>>
>>>>
>>>> STP (or any instruction that accesses memory) is fairly painful to
>>>> emulate because you need to ensure that the correct atomicity and
>>>> ordering properties are provided (e.g. an aligned STP should be
>>>> single-copy-atomic, but copy_to_user() doesn't guarantee that except by
>>>> chance), and that the correct VMSA behaviour is provided (e.g. when
>>>> interacting with MTE, POE, etc, while the uaccess primitives don't try
>>>> to be 100% equivalent to instructions in userspace).
>>> Agreed, but I don't think it has to emulate strictly the single-copy-atomic
>>> feature of STP that is used to push fp/lr into stack. In most cases, only one
>>> CPU will push registers to the same position on stack. And I barely understand
>>> why other CPUs would depends on the ordering of pushing data into stack. So it
>>> means the atomicity and ordering is not so important for this scenario. Regarding
>>> MTE and POE, a similar stragety to BTI and PAC can be applied: for CPUs and kernel
>>> with MTE and POE enabled, uprobe chooses the slower single-stepping to execute
>>> STP for pushing stack.
>>>
>>>>
>>>> For those reasons, in general I don't think we should be emulating any
>>>> instruction which accesses memory, and we should not try to emulate the
>>>> STP, but I think it's entirely reasonable to emulate NOP.
>>>>
>>>>> In order to
>>>>> improve the performance of handling uprobe for common usecases. This
>>>>> patch supports the emulation of Arm64 equvialents instructions of 'nop'
>>>>> and 'push'. The benchmark results below indicates the performance gain
>>>>> of emulation is obvious.
>>>>>
>>>>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
>>>>>
>>>>> xol (1 cpus)
>>>>> ------------
>>>>> uprobe-nop: 0.916 ± 0.001M/s (0.916M/prod)
>>>>> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
>>>>> uprobe-ret: 1.855 ± 0.000M/s (1.855M/prod)
>>>>> uretprobe-nop: 0.640 ± 0.000M/s (0.640M/prod)
>>>>> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
>>>>> uretprobe-ret: 0.978 ± 0.003M/s (0.978M/prod)
>>>>>
>>>>> emulation (1 cpus)
>>>>> -------------------
>>>>> uprobe-nop: 1.862 ± 0.002M/s (1.862M/prod)
>>>>> uprobe-push: 1.743 ± 0.006M/s (1.743M/prod)
>>>>> uprobe-ret: 1.840 ± 0.001M/s (1.840M/prod)
>>>>> uretprobe-nop: 0.964 ± 0.004M/s (0.964M/prod)
>>>>> uretprobe-push: 0.936 ± 0.004M/s (0.936M/prod)
>>>>> uretprobe-ret: 0.940 ± 0.001M/s (0.940M/prod)
>>>>>
>>>>> As shown above, the performance gap between 'nop/push' and 'ret'
>>>>> variants has been significantly reduced. Due to the emulation of 'push'
>>>>> instruction needs to access userspace memory, it spent more cycles than
>>>>> the other.
>>>>>
>>>>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@xxxxxxxxxxxxxx/
>>>>>
>>>>> Signed-off-by: Liao Chang <liaochang1@xxxxxxxxxx>
>>>>> ---
>>>>> arch/arm64/include/asm/insn.h | 21 ++++++++++++++++++
>>>>> arch/arm64/kernel/probes/decode-insn.c | 18 +++++++++++++--
>>>>> arch/arm64/kernel/probes/decode-insn.h | 3 ++-
>>>>> arch/arm64/kernel/probes/simulate-insn.c | 28 ++++++++++++++++++++++++
>>>>> arch/arm64/kernel/probes/simulate-insn.h | 2 ++
>>>>> arch/arm64/kernel/probes/uprobes.c | 2 +-
>>>>> 6 files changed, 70 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
>>>>> index 8c0a36f72d6f..a246e6e550ba 100644
>>>>> --- a/arch/arm64/include/asm/insn.h
>>>>> +++ b/arch/arm64/include/asm/insn.h
>>>>> @@ -549,6 +549,27 @@ static __always_inline bool aarch64_insn_uses_literal(u32 insn)
>>>>> aarch64_insn_is_prfm_lit(insn);
>>>>> }
>>>>>
>>>>> +static __always_inline bool aarch64_insn_is_nop(u32 insn)
>>>>> +{
>>>>> + /* nop */
>>>>> + return aarch64_insn_is_hint(insn) &&
>>>>> + ((insn & 0xFE0) == AARCH64_INSN_HINT_NOP);
>>>>> +}
>>>>
>>>> This looks fine, but the comment can go.
>>>
>>> Removed.
>>>
>>>>
>>>>> +static __always_inline bool aarch64_insn_is_stp_fp_lr_sp_64b(u32 insn)
>>>>> +{
>>>>> + /*
>>>>> + * The 1st instruction on function entry often follows the
>>>>> + * patten 'stp x29, x30, [sp, #imm]!' that pushing fp and lr
>>>>> + * into stack.
>>>>> + */
>>>>> + return aarch64_insn_is_stp_pre(insn) &&
>>>>> + (((insn >> 30) & 0x03) == 2) && /* opc == 10 */
>>>>> + (((insn >> 5) & 0x1F) == 31) && /* Rn is sp */
>>>>> + (((insn >> 10) & 0x1F) == 30) && /* Rt2 is x29 */
>>>>> + (((insn >> 0) & 0x1F) == 29); /* Rt is x30 */
>>>>> +}
>>>>
>>>> We have accessors for these fields. Please use them.
>>>
>>> Do you mean aarch64_insn_decode_register()?
>>>
>>>>
>>>> Regardless, as above I do not think we should have a helper this
>>>> specific (with Rn, Rt, and Rt2 values hard-coded) inside <asm/insn.h>.
>>>
>>> If we left necessary of emulation of STP aside, where would the best file to
>>> place these hard-coded decoder helper?
>>>
>>>>
>>>>> enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
>>>>> u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
>>>>> u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
>>>>> diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c
>>>>> index 968d5fffe233..df7ca16fc763 100644
>>>>> --- a/arch/arm64/kernel/probes/decode-insn.c
>>>>> +++ b/arch/arm64/kernel/probes/decode-insn.c
>>>>> @@ -73,8 +73,22 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn)
>>>>> * INSN_GOOD_NO_SLOT If instruction is supported but doesn't use its slot.
>>>>> */
>>>>> enum probe_insn __kprobes
>>>>> -arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api)
>>>>> +arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api,
>>>>> + bool kernel)
>>>>> {
>>>>> + /*
>>>>> + * While 'nop' and 'stp x29, x30, [sp, #imm]! instructions can
>>>>> + * execute in the out-of-line slot, simulating them in breakpoint
>>>>> + * handling offers better performance.
>>>>> + */
>>>>> + if (aarch64_insn_is_nop(insn)) {
>>>>> + api->handler = simulate_nop;
>>>>> + return INSN_GOOD_NO_SLOT;
>>>>> + } else if (!kernel && aarch64_insn_is_stp_fp_lr_sp_64b(insn)) {
>>>>> + api->handler = simulate_stp_fp_lr_sp_64b;
>>>>> + return INSN_GOOD_NO_SLOT;
>>>>> + }
>>>>
>>>> With the STP emulation gone, you won't need the kernel parameter here.>
>>>>> +
>>>>> /*
>>>>> * Instructions reading or modifying the PC won't work from the XOL
>>>>> * slot.
>>>>> @@ -157,7 +171,7 @@ arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi)
>>>>> else
>>>>> scan_end = addr - MAX_ATOMIC_CONTEXT_SIZE;
>>>>> }
>>>>> - decoded = arm_probe_decode_insn(insn, &asi->api);
>>>>> + decoded = arm_probe_decode_insn(insn, &asi->api, true);
>>>>>
>>>>> if (decoded != INSN_REJECTED && scan_end)
>>>>> if (is_probed_address_atomic(addr - 1, scan_end))
>>>>> diff --git a/arch/arm64/kernel/probes/decode-insn.h b/arch/arm64/kernel/probes/decode-insn.h
>>>>> index 8b758c5a2062..ec4607189933 100644
>>>>> --- a/arch/arm64/kernel/probes/decode-insn.h
>>>>> +++ b/arch/arm64/kernel/probes/decode-insn.h
>>>>> @@ -28,6 +28,7 @@ enum probe_insn __kprobes
>>>>> arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi);
>>>>> #endif
>>>>> enum probe_insn __kprobes
>>>>> -arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi);
>>>>> +arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi,
>>>>> + bool kernel);
>>>>>
>>>>> #endif /* _ARM_KERNEL_KPROBES_ARM64_H */
>>>>> diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c
>>>>> index 22d0b3252476..0b1623fa7003 100644
>>>>> --- a/arch/arm64/kernel/probes/simulate-insn.c
>>>>> +++ b/arch/arm64/kernel/probes/simulate-insn.c
>>>>> @@ -200,3 +200,31 @@ simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs)
>>>>>
>>>>> instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>>>>> }
>>>>> +
>>>>> +void __kprobes
>>>>> +simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
>>>>> +{
>>>>> + instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>>>>> +}
>>>>
>>>> Hmm, this forgets to update the single-step state machine and PSTATE.BT,
>>>> and that's an extant bug in arch_uprobe_post_xol(). This can be:
>>>
>>> For emulated instruction, uprobe won't enable single-step mode of CPU,
>>> please check the handle_swbp() in kernel/events/uprobes.c:
>>>
>>> if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
>>> goto out;
>>>
>>> if (!pre_ssout(uprobe, regs, bp_vaddr))
>>> return;
>>>
>>> For emualted instruction, It will skip entire single-stepping and associated
>>> exception, which typically begins with pre_ssout() and ends with
>>> arch_uprobe_post_xol(). Therefore, using instruction_pointer_set() to emulate
>>> NOP is generally not a bad idea.
>>>
>>>>
>>>> | void __kprobes
>>>> | simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
>>>> | {
>>>> | arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
>>>> | }
>>>>
>>>>> +
>>>>> +void __kprobes
>>>>> +simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs)
>>>>> +{
>>>>> + long imm7;
>>>>> + u64 buf[2];
>>>>> + long new_sp;
>>>>> +
>>>>> + imm7 = sign_extend64((opcode >> 15) & 0x7f, 6);
>>>>> + new_sp = regs->sp + (imm7 << 3);
>>>>
>>>> We have accessors for these fields, please use them.
>>>
>>> Do you mean aarch64_insn_decode_immediate()?
>>>
>>>>
>>>>> +
>>>>> + buf[0] = regs->regs[29];
>>>>> + buf[1] = regs->regs[30];
>>>>> +
>>>>> + if (copy_to_user((void __user *)new_sp, buf, sizeof(buf))) {
>>>>> + force_sig(SIGSEGV);
>>>>> + return;
>>>>> + }
>>>>
>>>> As above, this won't interact with VMSA features (e.g. MTE, POE) in the
>>>> same way as an STP in userspace, and this will not have the same
>>>> atomicity properties as an STP>
>>>>> +
>>>>> + regs->sp = new_sp;
>>>>> + instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>>>>
>>>> Likewise, this sould need ot use arm64_skip_faulting_instruction(),
>>>> though as above I think we should drop STP emulation entirely.
>>>
>>> I explain the reason why using instruction_pointer_set() under your comments
>>> for simulate_nop().
>>>
>>> Thanks.
>>>
>>>>
>>>> Mark.
>>>>
>>>
>>
>> --
>> BR
>> Liao, Chang
>
--
BR
Liao, Chang