Re: [PATCH bpf-next] bpf,x86: do RSB balance for trampoline

From: Menglong Dong
Date: Mon Nov 10 2025 - 06:44:32 EST


On 2025/11/6 10:56, Alexei Starovoitov wrote:
> On Wed, Nov 5, 2025 at 6:49 PM Menglong Dong <menglong.dong@xxxxxxxxx> wrote:
> >
> > On 2025/11/6 09:40, Menglong Dong wrote:
> > > On 2025/11/6 07:31, Alexei Starovoitov wrote:
> > > > On Tue, Nov 4, 2025 at 11:47 PM Menglong Dong <menglong.dong@xxxxxxxxx> wrote:
[......]
> > > >
> > > > Here another idea...
> > > > hack tr->func.ftrace_managed = false temporarily
> > > > and use BPF_MOD_JUMP in bpf_arch_text_poke()
> > > > when installing trampoline with fexit progs.
> > > > and also do:
> > > > @@ -3437,10 +3437,6 @@ static int __arch_prepare_bpf_trampoline(struct
> > > > bpf_tramp_image *im, void *rw_im
> > > >
> > > > emit_ldx(&prog, BPF_DW, BPF_REG_6, BPF_REG_FP, -rbx_off);
> > > > EMIT1(0xC9); /* leave */
> > > > - if (flags & BPF_TRAMP_F_SKIP_FRAME) {
> > > > - /* skip our return address and return to parent */
> > > > - EMIT4(0x48, 0x83, 0xC4, 8); /* add rsp, 8 */
> > > > - }
> > > > emit_return(&prog, image + (prog - (u8 *)rw_image));
> > > >
> > > > Then RSB is perfectly matched without messing up the stack
> > > > and/or extra calls.
> > > > If it works and performance is good the next step is to
> > > > teach ftrace to emit jmp or call in *_ftrace_direct()
> >
> > After the modification, the performance of fexit increase from
> > 76M/s to 137M/s, awesome!
>
> Nice! much better than double 'ret' :)
> _ftrace_direct() next?

Hi, all

Do you think if it is worth to implement the livepatch with
bpf trampoline by introduce the CONFIG_LIVEPATCH_BPF?
It's easy to achieve it, I have a POC for it, and the performance
of the livepatch increase from 99M/s to 200M/s according to
my bench testing.

The results above is tested with return-trunk disabled. With the
return-trunk enabled, the performance decrease from 58M/s to
52M/s. The main performance improvement comes from the RSB,
and the return-trunk will always break the RSB, which makes it has
no improvement. The calling to per-cpu-ref get and put make
the bpf trampoline based livepatch has a worse performance
than ftrace based.

Thanks!
Menglong Dong

>