Re: [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset

From: Menglong Dong
Date: Wed Mar 05 2025 - 22:00:24 EST


On Wed, Mar 5, 2025 at 11:02 PM Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
>
> On Wed, 5 Mar 2025 09:19:09 +0800
> Menglong Dong <menglong8.dong@xxxxxxxxx> wrote:
>
> > Ok, let me explain it from the beginning. (My English is not good,
> > but I'll try to describe it as clear as possible :/)
>
> I always appreciate those who struggle with English having these
> conversations. Thank you for that, as I know I am horrible in speaking any
> other language. (I can get by in German, but even Germans tell me to switch
> back to English ;-)
>
> >
> > Many BPF program types need to depend on the BPF trampoline,
> > such as BPF_PROG_TYPE_TRACING, BPF_PROG_TYPE_EXT,
> > BPF_PROG_TYPE_LSM, etc. BPF trampoline is a bridge between
> > the kernel (or bpf) function and BPF program, and it acts just like the
> > trampoline that ftrace uses.
> >
> > Generally speaking, it is used to hook a function, just like what ftrace
> > do:
> >
> > foo:
> > endbr
> > nop5 --> call trampoline_foo
> > xxxx
> >
> > In short, the trampoline_foo can be this:
> >
> > trampoline_foo:
> > prepare a array and store the args of foo to the array
> > call fentry_bpf1
> > call fentry_bpf2
> > ......
> > call foo+4 (origin call)
>
> Note, I brought up this issue when I first heard about how BPF does this.
> The calling of the original function from the trampoline. I said this will
> cause issues, and is only good for a few functions. Once you start doing
> this for 1000s of functions, it's going to be a nightmare.
>
> Looks like you are now in the nightmare phase.
>
> My argument was once you have this case, you need to switch over to the
> kretprobe / function graph way of doing things, which is to have a shadow
> stack and hijack the return address. Yes, that has slightly more overhead,
> but it's better than having to add all theses hacks.
>
> And function graph has been updated so that it can do this for other users.
> fprobes uses it now, and bpf can too.

Yeah, I heard that the kretprobe is able to get the function
arguments too, which benefits from the function graph.

Besides the overhead, another problem is that we can't do
direct memory access if we use the BPF based on kretprobe.

>
> > save the return value of foo
> > call fexit_bpf1 (this bpf can get the return value of foo)
> > call fexit_bpf2
> > .......
> > return to the caller of foo
> >
> > We can see that the trampoline_foo can be only used for
> > the function foo, as different kernel function can be attached
> > different BPF programs, and have different argument count,
> > etc. Therefore, we have to create 1000 BPF trampolines if
> > we want to attach a BPF program to 1000 kernel functions.
> >
> > The creation of the BPF trampoline is expensive. According to
> > my testing, It will spend more than 1 second to create 100 bpf
> > trampoline. What's more, it consumes more memory.
> >
> > If we have the per-function metadata supporting, then we can
> > create a global BPF trampoline, like this:
> >
> > trampoline_global:
> > prepare a array and store the args of foo to the array
> > get the metadata by the ip
> > call metadata.fentry_bpf1
> > call metadata.fentry_bpf2
> > ....
> > call foo+4 (origin call)
>
> So if this is a global trampoline, wouldn't this "call foo" need to be an
> indirect call? It can't be a direct call, otherwise you need a separate
> trampoline for that.
>
> This means you need to mitigate for spectre here, and you just lost the
> performance gain from not using function graph.

Yeah, you are right, this is an indirect call here. I haven't done
any research on mitigating for spectre yet, and maybe we can
convert it into a direct call somehow? Such as, we maintain a
trampoline_table:
some preparation
jmp +%eax (eax is the index of the target function)
call foo1 + 4
return
call foo2 + 4
return
call foo3 + 4
return

(Hmm......Is the jmp above also an indirect call?)

And in the trampoline_global, we can call it like this:

mov metadata.index %eax
call trampoline_table

I'm not sure if it works. However, indirect call is also used
in function graph, so we still have better performance. Isn't it?

Let me have a look at the code of the function graph first :/

Thanks!
Menglong Dong

>
>
> > save the return value of foo
> > call metadata.fexit_bpf1 (this bpf can get the return value of foo)
> > call metadata.fexit_bpf2
> > .......
> > return to the caller of foo
> >
> > (The metadata holds more information for the global trampoline than
> > I described.)
> >
> > Then, we don't need to create a trampoline for every kernel function
> > anymore.
> >
> > Another beneficiary can be ftrace. For now, all the kernel functions that
> > are enabled by dynamic ftrace will be added to a filter hash if there are
> > more than one callbacks. And hash lookup will happen when the traced
> > functions are called, which has an impact on the performance, see
> > __ftrace_ops_list_func() -> ftrace_ops_test(). With the per-function
> > metadata supporting, we can store the information that if the callback is
> > enabled on the kernel function to the metadata, which can make the performance
> > much better.
>
> Let me say now that ftrace will not use this. Looks like too much work for
> little gain. The only time this impacts ftrace is when there's two
> different callbacks tracing the same function, and it only impacts that
> function. All other functions being traced still call the appropriate
> trampoline for the callback.
>
> -- Steve
>
> >
> > The per-function metadata storage is a basic function, and I think there
> > may be other functions that can use it for better performance in the feature
> > too.
> >
> > (Hope that I'm describing it clearly :/)
>