Re: [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset

From: Steven Rostedt
Date: Wed Mar 05 2025 - 10:06:02 EST

Next message: Dev Jain: "Re: [QUESTION] Plain dereference and READ_ONCE() in fault handler"
Previous message: Dan Carpenter: "[PATCH next] afs: Fix error code in afs_alloc_cell()"
In reply to: Menglong Dong: "Re: [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset"
Next in thread: Menglong Dong: "Re: [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, 5 Mar 2025 09:19:09 +0800
Menglong Dong <menglong8.dong@xxxxxxxxx> wrote:

> Ok, let me explain it from the beginning. (My English is not good,
> but I'll try to describe it as clear as possible :/)

I always appreciate those who struggle with English having these
conversations. Thank you for that, as I know I am horrible in speaking any
other language. (I can get by in German, but even Germans tell me to switch
back to English ;-)

>
> Many BPF program types need to depend on the BPF trampoline,
> such as BPF_PROG_TYPE_TRACING, BPF_PROG_TYPE_EXT,
> BPF_PROG_TYPE_LSM, etc. BPF trampoline is a bridge between
> the kernel (or bpf) function and BPF program, and it acts just like the
> trampoline that ftrace uses.
>
> Generally speaking, it is used to hook a function, just like what ftrace
> do:
>
> foo:
> endbr
> nop5 --> call trampoline_foo
> xxxx
>
> In short, the trampoline_foo can be this:
>
> trampoline_foo:
> prepare a array and store the args of foo to the array
> call fentry_bpf1
> call fentry_bpf2
> ......
> call foo+4 (origin call)

Note, I brought up this issue when I first heard about how BPF does this.
The calling of the original function from the trampoline. I said this will
cause issues, and is only good for a few functions. Once you start doing
this for 1000s of functions, it's going to be a nightmare.

Looks like you are now in the nightmare phase.

My argument was once you have this case, you need to switch over to the
kretprobe / function graph way of doing things, which is to have a shadow
stack and hijack the return address. Yes, that has slightly more overhead,
but it's better than having to add all theses hacks.

And function graph has been updated so that it can do this for other users.
fprobes uses it now, and bpf can too.

> save the return value of foo
> call fexit_bpf1 (this bpf can get the return value of foo)
> call fexit_bpf2
> .......
> return to the caller of foo
>
> We can see that the trampoline_foo can be only used for
> the function foo, as different kernel function can be attached
> different BPF programs, and have different argument count,
> etc. Therefore, we have to create 1000 BPF trampolines if
> we want to attach a BPF program to 1000 kernel functions.
>
> The creation of the BPF trampoline is expensive. According to
> my testing, It will spend more than 1 second to create 100 bpf
> trampoline. What's more, it consumes more memory.
>
> If we have the per-function metadata supporting, then we can
> create a global BPF trampoline, like this:
>
> trampoline_global:
> prepare a array and store the args of foo to the array
> get the metadata by the ip
> call metadata.fentry_bpf1
> call metadata.fentry_bpf2
> ....
> call foo+4 (origin call)

So if this is a global trampoline, wouldn't this "call foo" need to be an
indirect call? It can't be a direct call, otherwise you need a separate
trampoline for that.

This means you need to mitigate for spectre here, and you just lost the
performance gain from not using function graph.

> save the return value of foo
> call metadata.fexit_bpf1 (this bpf can get the return value of foo)
> call metadata.fexit_bpf2
> .......
> return to the caller of foo
>
> (The metadata holds more information for the global trampoline than
> I described.)
>
> Then, we don't need to create a trampoline for every kernel function
> anymore.
>
> Another beneficiary can be ftrace. For now, all the kernel functions that
> are enabled by dynamic ftrace will be added to a filter hash if there are
> more than one callbacks. And hash lookup will happen when the traced
> functions are called, which has an impact on the performance, see
> __ftrace_ops_list_func() -> ftrace_ops_test(). With the per-function
> metadata supporting, we can store the information that if the callback is
> enabled on the kernel function to the metadata, which can make the performance
> much better.

Let me say now that ftrace will not use this. Looks like too much work for
little gain. The only time this impacts ftrace is when there's two
different callbacks tracing the same function, and it only impacts that
function. All other functions being traced still call the appropriate
trampoline for the callback.

-- Steve

>
> The per-function metadata storage is a basic function, and I think there
> may be other functions that can use it for better performance in the feature
> too.
>
> (Hope that I'm describing it clearly :/)

Next message: Dev Jain: "Re: [QUESTION] Plain dereference and READ_ONCE() in fault handler"
Previous message: Dan Carpenter: "[PATCH next] afs: Fix error code in afs_alloc_cell()"
In reply to: Menglong Dong: "Re: [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset"
Next in thread: Menglong Dong: "Re: [PATCH v4 1/4] x86/ibt: factor out cfi and fineibt offset"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]