Re: [PATCH v6 3/5] tracing/bpf-trace: Add support for faultable tracepoints

From: Andrii Nakryiko
Date: Mon Sep 09 2024 - 12:53:31 EST


On Mon, Sep 9, 2024 at 8:11 AM Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>
> On 2024-09-04 21:21, Andrii Nakryiko wrote:
> > On Wed, Aug 28, 2024 at 7:42 AM Mathieu Desnoyers
> > <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
> >>
> >> In preparation for converting system call enter/exit instrumentation
> >> into faultable tracepoints, make sure that bpf can handle registering to
> >> such tracepoints by explicitly disabling preemption within the bpf
> >> tracepoint probes to respect the current expectations within bpf tracing
> >> code.
> >>
> >> This change does not yet allow bpf to take page faults per se within its
> >> probe, but allows its existing probes to connect to faultable
> >> tracepoints.
> >>
> >> Link: https://lore.kernel.org/lkml/20231002202531.3160-1-mathieu.desnoyers@xxxxxxxxxxxx/
> >> Co-developed-by: Michael Jeanson <mjeanson@xxxxxxxxxxxx>
> >> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx>
> >> Signed-off-by: Michael Jeanson <mjeanson@xxxxxxxxxxxx>
> >> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx>
> >> Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
> >> Cc: Masami Hiramatsu <mhiramat@xxxxxxxxxx>
> >> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> >> Cc: Alexei Starovoitov <ast@xxxxxxxxxx>
> >> Cc: Yonghong Song <yhs@xxxxxx>
> >> Cc: Paul E. McKenney <paulmck@xxxxxxxxxx>
> >> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> >> Cc: Arnaldo Carvalho de Melo <acme@xxxxxxxxxx>
> >> Cc: Mark Rutland <mark.rutland@xxxxxxx>
> >> Cc: Alexander Shishkin <alexander.shishkin@xxxxxxxxxxxxxxx>
> >> Cc: Namhyung Kim <namhyung@xxxxxxxxxx>
> >> Cc: bpf@xxxxxxxxxxxxxxx
> >> Cc: Joel Fernandes <joel@xxxxxxxxxxxxxxxxx>
> >> ---
> >> Changes since v4:
> >> - Use DEFINE_INACTIVE_GUARD.
> >> - Add brackets to multiline 'if' statements.
> >> Changes since v5:
> >> - Rebased on v6.11-rc5.
> >> - Pass the TRACEPOINT_MAY_FAULT flag directly to tracepoint_probe_register_prio_flags.
> >> ---
> >> include/trace/bpf_probe.h | 21 ++++++++++++++++-----
> >> kernel/trace/bpf_trace.c | 2 +-
> >> 2 files changed, 17 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
> >> index a2ea11cc912e..cc96dd1e7c3d 100644
> >> --- a/include/trace/bpf_probe.h
> >> +++ b/include/trace/bpf_probe.h
> >> @@ -42,16 +42,27 @@
> >> /* tracepoints with more than 12 arguments will hit build error */
> >> #define CAST_TO_U64(...) CONCATENATE(__CAST, COUNT_ARGS(__VA_ARGS__))(__VA_ARGS__)
> >>
> >> -#define __BPF_DECLARE_TRACE(call, proto, args) \
> >> +#define __BPF_DECLARE_TRACE(call, proto, args, tp_flags) \
> >> static notrace void \
> >> __bpf_trace_##call(void *__data, proto) \
> >> { \
> >> - CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args)); \
> >> + DEFINE_INACTIVE_GUARD(preempt_notrace, bpf_trace_guard); \
> >> + \
> >> + if ((tp_flags) & TRACEPOINT_MAY_FAULT) { \
> >> + might_fault(); \
> >> + activate_guard(preempt_notrace, bpf_trace_guard)(); \
> >> + } \
> >> + \
> >> + CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args)); \
> >> }
> >>
> >> #undef DECLARE_EVENT_CLASS
> >> #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
> >> - __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))
> >> + __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), 0)
> >> +
> >> +#undef DECLARE_EVENT_CLASS_MAY_FAULT
> >> +#define DECLARE_EVENT_CLASS_MAY_FAULT(call, proto, args, tstruct, assign, print) \
> >> + __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), TRACEPOINT_MAY_FAULT)
> >>
> >> /*
> >> * This part is compiled out, it is only here as a build time check
> >> @@ -105,13 +116,13 @@ static inline void bpf_test_buffer_##call(void) \
> >>
> >> #undef DECLARE_TRACE
> >> #define DECLARE_TRACE(call, proto, args) \
> >> - __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args)) \
> >> + __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), 0) \
> >> __DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), 0)
> >>
> >> #undef DECLARE_TRACE_WRITABLE
> >> #define DECLARE_TRACE_WRITABLE(call, proto, args, size) \
> >> __CHECK_WRITABLE_BUF_SIZE(call, PARAMS(proto), PARAMS(args), size) \
> >> - __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args)) \
> >> + __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args), 0) \
> >> __DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), size)
> >>
> >> #include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
> >> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> >> index c77eb80cbd7f..ed07283d505b 100644
> >> --- a/kernel/trace/bpf_trace.c
> >> +++ b/kernel/trace/bpf_trace.c
> >> @@ -2473,7 +2473,7 @@ int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_raw_tp_link *li
> >>
> >> return tracepoint_probe_register_prio_flags(tp, (void *)btp->bpf_func,
> >> link, TRACEPOINT_DEFAULT_PRIO,
> >> - TRACEPOINT_MAY_EXIST);
> >> + TRACEPOINT_MAY_EXIST | (tp->flags & TRACEPOINT_MAY_FAULT));
> >> }
> >>
> >> int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_raw_tp_link *link)
> >> --
> >> 2.39.2
> >>
> >>
> >
> > I wonder if it would be better to just do this, instead of that
> > preempt guard. I think we don't strictly need preemption to be
> > disabled, we just need to stay on the same CPU, just like we do that
> > for many other program types.
>
> I'm worried about introducing any kind of subtle synchronization
> change in this series, and moving from preempt-off to migrate-disable
> definitely falls under that umbrella.
>
> I would recommend auditing all uses of this_cpu_*() APIs to make sure
> accesses to per-cpu data structures are using atomics and not just using
> operations that expect use of preempt-off to prevent concurrent threads
> from updating to the per-cpu data concurrently.
>
> So what you are suggesting may be a good idea, but I prefer to leave
> this kind of change to a separate bpf-specific series, and I would
> leave this work to someone who knows more about ebpf than me.
>

Yeah, that's ok. migrate_disable() switch is probably going a bit too
far too fast, but I think we should just add
preempt_disable/preempt_enable inside __bpf_trace_run() instead of
leaving it inside those hard to find and follow tracepoint macros. So
maybe you can just pass a bool into __bpf_trace_run() and do preempt
guard (or explicit disable/enable) there?

> Thanks,
>
> Mathieu
>
> >
> > We'll need some more BPF-specific plumbing to fully support faultable
> > (sleepable) tracepoints, but this should unblock your work, unless I'm
> > missing something. And we can take it from there, once your patches
> > land, to take advantage of faultable tracepoints in the BPF ecosystem.
> >
> > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > index b69a39316c0c..415639b7c7a4 100644
> > --- a/kernel/trace/bpf_trace.c
> > +++ b/kernel/trace/bpf_trace.c
> > @@ -2302,7 +2302,8 @@ void __bpf_trace_run(struct bpf_raw_tp_link
> > *link, u64 *args)
> > struct bpf_run_ctx *old_run_ctx;
> > struct bpf_trace_run_ctx run_ctx;
> >
> > - cant_sleep();
> > + migrate_disable();
> > +
> > if (unlikely(this_cpu_inc_return(*(prog->active)) != 1)) {
> > bpf_prog_inc_misses_counter(prog);
> > goto out;
> > @@ -2318,6 +2319,8 @@ void __bpf_trace_run(struct bpf_raw_tp_link
> > *link, u64 *args)
> > bpf_reset_run_ctx(old_run_ctx);
> > out:
> > this_cpu_dec(*(prog->active));
> > +
> > + migrate_enable();
> > }
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>