Re: [PATCH] arm64: uprobes: Simulate STP for pushing fp/lr into user stack

From: Andrii Nakryiko
Date: Wed Sep 11 2024 - 16:36:35 EST


On Tue, Sep 10, 2024 at 8:07 PM Liao, Chang <liaochang1@xxxxxxxxxx> wrote:
>
>
>
> 在 2024/9/11 4:54, Andrii Nakryiko 写道:
> > On Mon, Sep 9, 2024 at 11:14 PM Liao Chang <liaochang1@xxxxxxxxxx> wrote:
> >>
> >> This patch is the second part of a series to improve the selftest bench
> >> of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]'
> >> significantly impact uprobe/uretprobe performance at function entry in
> >> most user cases. Profiling results below reveals the STP that executes
> >> in the xol slot and trap back to kernel, reduce redis RPS and increase
> >> the time of string grep obviously.
> >>
> >> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
> >>
> >> Redis GET (higher is better)
> >> ----------------------------
> >> No uprobe: 49149.71 RPS
> >> Single-stepped STP: 46750.82 RPS
> >> Emulated STP: 48981.19 RPS
> >>
> >> Redis SET (larger is better)
> >> ----------------------------
> >> No uprobe: 49761.14 RPS
> >> Single-stepped STP: 45255.01 RPS
> >> Emulated stp: 48619.21 RPS
> >>
> >> Grep (lower is better)
> >> ----------------------
> >> No uprobe: 2.165s
> >> Single-stepped STP: 15.314s
> >> Emualted STP: 2.216s
> >>
> >> Additionally, a profiling of the entry instruction for all leaf and
> >> non-leaf function, the ratio of 'stp fp, lr, [sp, #imm]' is larger than
> >> 50%. So simulting the STP on the function entry is a more viable option
> >> for uprobe.
> >>
> >> In the first version [1], it used a uaccess routine to simulate the STP
> >> that push fp/lr into stack, which use double STTR instructions for
> >> memory store. But as Mark pointed out, this approach can't simulate the
> >> correct single-atomicity and ordering properties of STP, especiallly
> >> when it interacts with MTE, POE, etc. So this patch uses a more complex
> >
> > Does all those effects matter if the thread is stopped after
> > breakpoint? This is pushing to stack, right? Other threads are not
> > supposed to access that memory anyways (not the well-defined ones, at
> > least, I suppose). Do we really need all these complications for
>
> I have raised the same question in my reply to Mark. Since the STP
> simulation focuses on the uprobe/uretprob at function entry, which
> push two registers onto *stack*. I believe it might not require strict
> alignment with the exact property of STP. However, as you know, Mark

Agreed.

> stand by his comments about STP simulation, which is why I send this
> patch out. Although the gain is not good as the uaccess version, it
> still offer some better result than the current XOL code.
>
> > uprobes? We use a similar approach in x86-64, see emulate_push_stack()
> > in arch/x86/kernel/uprobes.c and it works great in practice (and has
>
> Yes, I've noticed the X86 routine. Actually. The CPU-specific difference
> lies in Arm64 CPUs with PAN enabled. Due to security reasons, it doesn't
> support STP (storing pairs of registers to memory) when accessing userpsace
> address. This leads to kernel has to use STTR instructions (storing single
> register to unprivileged memory) twice, which can't meet the atomicity
> and ordering properties of original STP at userspace. In future, if Arm64
> would add some instruction for storing pairs of registers to unprivileged
> memory, it ought to replace this inefficient approach.
>
> > been for years by now). Would be nice to keep things simple knowing
> > that this is specifically for this rather well-defined and restricted
> > uprobe/uretprobe use case.
> >
> > Sorry, I can't help reviewing this, but I have a hunch that we might
> > be over-killing it with this approach, no?
>
> This approach fails to obtain the max benefit from simuation indeed.
>

Yes, the performance hit is very large for seemingly no good reason,
which is why I'm asking.

And all this performance concern is not just some pure
microbenchmarking. We do have use cases with millions of uprobe calls
per second. E.g., tracing every single Python function call, then
rolling a dice (in BPF program), and sampling some portion of them
(more heavy-weight logic). As such, it's critical to be able to
trigger uprobe as fast as possible, then most of the time we do
nothing. So any overheads like this one are very noticeable and limit
possible applications.

> >
> >
> >> and inefficient approach that acquires user stack pages, maps them to
> >> kernel address space, and allows kernel to use STP directly push fp/lr
> >> into the stack pages.
> >>
> >> xol-stp
> >> -------
> >> uprobe-nop ( 1 cpus): 1.566 ± 0.006M/s ( 1.566M/s/cpu)
> >> uprobe-push ( 1 cpus): 0.868 ± 0.001M/s ( 0.868M/s/cpu)
> >> uprobe-ret ( 1 cpus): 1.629 ± 0.001M/s ( 1.629M/s/cpu)
> >> uretprobe-nop ( 1 cpus): 0.871 ± 0.001M/s ( 0.871M/s/cpu)
> >> uretprobe-push ( 1 cpus): 0.616 ± 0.001M/s ( 0.616M/s/cpu)
> >> uretprobe-ret ( 1 cpus): 0.878 ± 0.002M/s ( 0.878M/s/cpu)
> >>
> >> simulated-stp
> >> -------------
> >> uprobe-nop ( 1 cpus): 1.544 ± 0.001M/s ( 1.544M/s/cpu)
> >> uprobe-push ( 1 cpus): 1.128 ± 0.002M/s ( 1.128M/s/cpu)
> >> uprobe-ret ( 1 cpus): 1.550 ± 0.005M/s ( 1.550M/s/cpu)
> >> uretprobe-nop ( 1 cpus): 0.872 ± 0.004M/s ( 0.872M/s/cpu)
> >> uretprobe-push ( 1 cpus): 0.714 ± 0.001M/s ( 0.714M/s/cpu)
> >> uretprobe-ret ( 1 cpus): 0.896 ± 0.001M/s ( 0.896M/s/cpu)
> >>
> >> The profiling results based on the upstream kernel with spinlock
> >> optimization patches [2] reveals the simulation of STP increase the
> >> uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and
> >> uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu).
> >>
> >> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@xxxxxxxxxxxxxx/
> >> [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/
> >> [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@xxxxxxxxxx/
> >>
> >> Signed-off-by: Liao Chang <liaochang1@xxxxxxxxxx>
> >> ---
> >> arch/arm64/include/asm/insn.h | 1 +
> >> arch/arm64/kernel/probes/decode-insn.c | 16 +++++
> >> arch/arm64/kernel/probes/decode-insn.h | 1 +
> >> arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++
> >> arch/arm64/kernel/probes/simulate-insn.h | 1 +
> >> arch/arm64/kernel/probes/uprobes.c | 21 ++++++
> >> arch/arm64/lib/insn.c | 5 ++
> >> 7 files changed, 134 insertions(+)
> >>
> >
> > [...]
> >
> >
>
> --
> BR
> Liao, Chang