Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain

From: Tao Chen

Date: Fri Feb 06 2026 - 04:23:15 EST

在 2026/2/6 01:34, Andrii Nakryiko 写道:

On Wed, Feb 4, 2026 at 10:16 PM Tao Chen <chen.dylane@xxxxxxxxx> wrote:

在 2026/2/4 09:08, Andrii Nakryiko 写道:

On Sun, Jan 25, 2026 at 11:45 PM Tao Chen <chen.dylane@xxxxxxxxx> wrote:

From BPF stack map, we want to ensure that the callchain buffer
will not be overwritten by other preemptive tasks and we also aim
to reduce the preempt disable interval, Based on the suggestions from Peter
and Andrrii, export new API __get_perf_callchain and the usage scenarios
are as follows from BPF side:

preempt_disable()
entry = get_callchain_entry()
preempt_enable()
__get_perf_callchain(entry)
put_callchain_entry(entry)

Suggested-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
Signed-off-by: Tao Chen <chen.dylane@xxxxxxxxx>
---
include/linux/perf_event.h | 5 +++++
kernel/events/callchain.c | 34 ++++++++++++++++++++++------------
2 files changed, 27 insertions(+), 12 deletions(-)

Looking at the whole __bpf_get_stack() logic again, why didn't we just
do something like this:

diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index da3d328f5c15..80364561611c 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -460,8 +460,8 @@ static long __bpf_get_stack(struct pt_regs *regs,
struct task_struct *task,

max_depth = stack_map_calculate_max_depth(size, elem_size, flags);

- if (may_fault)
- rcu_read_lock(); /* need RCU for perf's callchain below */
+ if (!trace_in)
+ preempt_disable();

if (trace_in) {
trace = trace_in;
@@ -474,8 +474,8 @@ static long __bpf_get_stack(struct pt_regs *regs,
struct task_struct *task,
}

if (unlikely(!trace) || trace->nr < skip) {
- if (may_fault)
- rcu_read_unlock();
+ if (!trace_in)
+ preempt_enable();
goto err_fault;
}

@@ -494,8 +494,8 @@ static long __bpf_get_stack(struct pt_regs *regs,
struct task_struct *task,
}

/* trace/ips should not be dereferenced after this point */
- if (may_fault)
- rcu_read_unlock();
+ if (!trace_in)
+ preempt_enable();

if (user_build_id)
stack_map_get_build_id_offset(buf, trace_nr, user, may_fault);

?

Build ID parsing is happening after we copied data from perf's
callchain_entry into user-provided memory, so raw callchain retrieval
can be done with preemption disabled, as it's supposed to be brief.
Build ID parsing part which indeed might fault and be much slower will
be done well after that (we even have a comment there saying that
trace/ips should not be touched).

Hi Andrii,

Building upon your work, I have also added preempt_disable() to bpf_get_stackid, and try to reduce the length of preempt section.
Please review, thanks.

https://lore.kernel.org/bpf/20260206090653.1336687-1-chen.dylane@xxxxxxxxx

Am I missing something?

Yes it looks good for bpf_get_stack, and I also proposed a similar
change previously. But Alexei suggested that we should reduce the
preemption-disabled section protected in bpf_get_stackid if we do like
bpf_get_stack. Maybe we can change it first for bpf_get_stack?

https://lore.kernel.org/bpf/20250922075333.1452803-1-chen.dylane@xxxxxxxxx/

This is broken because we are still using trace after you re-enabled
preemption. We need to keep preemption disabled until we copy captured
stack trace data from ips into our own memory.

[...]

--
Best Regards
Tao Chen

--
Best Regards
Tao Chen