Re: [PATCH 1/2] uprobes: Optimize the return_instance related routines

From: Andrii Nakryiko
Date: Tue Jul 09 2024 - 19:55:55 EST


On Mon, Jul 8, 2024 at 6:00 PM Liao Chang <liaochang1@xxxxxxxxxx> wrote:
>
> Reduce the runtime overhead for struct return_instance data managed by
> uretprobe. This patch replaces the dynamic allocation with statically
> allocated array, leverage two facts that are limited nesting depth of
> uretprobe (max 64) and the function call style of return_instance usage
> (create at entry, free at exit).
>
> This patch has been tested on Kunpeng916 (Hi1616), 4 NUMA nodes, 64
> cores @ 2.4GHz. Redis benchmarks show a throughput gain by 2% for Redis
> GET and SET commands:
>
> ------------------------------------------------------------------
> Test case | No uretprobes | uretprobes | uretprobes
> | | (current) | (optimized)
> ==================================================================
> Redis SET (RPS) | 47025 | 40619 (-13.6%) | 41529 (-11.6%)
> ------------------------------------------------------------------
> Redis GET (RPS) | 46715 | 41426 (-11.3%) | 42306 (-9.4%)
> ------------------------------------------------------------------
>
> Signed-off-by: Liao Chang <liaochang1@xxxxxxxxxx>
> ---
> include/linux/uprobes.h | 10 ++-
> kernel/events/uprobes.c | 162 ++++++++++++++++++++++++----------------
> 2 files changed, 105 insertions(+), 67 deletions(-)
>

[...]

> +static void cleanup_return_instances(struct uprobe_task *utask, bool chained,
> + struct pt_regs *regs)
> +{
> + struct return_frame *frame = &utask->frame;
> + struct return_instance *ri = frame->return_instance;
> + enum rp_check ctx = chained ? RP_CHECK_CHAIN_CALL : RP_CHECK_CALL;
> +
> + while (ri && !arch_uretprobe_is_alive(ri, ctx, regs)) {
> + ri = next_ret_instance(frame, ri);
> + utask->depth--;
> + }
> + frame->return_instance = ri;
> +}
> +
> +static struct return_instance *alloc_return_instance(struct uprobe_task *task)
> +{
> + struct return_frame *frame = &task->frame;
> +
> + if (!frame->vaddr) {
> + frame->vaddr = kcalloc(MAX_URETPROBE_DEPTH,
> + sizeof(struct return_instance), GFP_KERNEL);

Are you just pre-allocating MAX_URETPROBE_DEPTH instances always?
I.e., even if we need just one (because there is no recursion), you'd
still waste memory for all 64 ones?

That seems rather wasteful.

Have you considered using objpool for fast reuse across multiple CPUs?
Check lib/objpool.c.

> + if (!frame->vaddr)
> + return NULL;
> + }
> +
> + if (!frame->return_instance) {
> + frame->return_instance = frame->vaddr;
> + return frame->return_instance;
> + }
> +
> + return ++frame->return_instance;
> +}
> +
> +static inline bool return_frame_empty(struct uprobe_task *task)
> +{
> + return !task->frame.return_instance;
> }
>
> /*

[...]