Re: Can we switch the tracepoints from preempt protection to rcu_read_lock?
From: Mathieu Desnoyers
Date: Tue Dec 10 2024 - 14:03:57 EST
On 2024-12-06 12:07, Steven Rostedt wrote:
Hi Mathieu,
Hi Steven,
Sebastian brought up a point at our RT Stable meeting. BPF hooks into
tracepoints and can cause long latency on RT setups.
Indeed, as expected if BPF don't have BPF hook duration validations in
place.
IIRC, tracepoints themselves do not need to have preemption disabled. It's
just that some of the users of tracepoints expect preemption to be disabled.
Correct. Tracepoints need to have some mean of synchronizing callback
iteration with the callback registration/unregistration (RCU). Which
flavor is used is based on the constraints of the execution contexts
in which tracepoints are inserted.
Then the fact that tracer probe functions expect that preemption is
disabled when called is merely a consequence of the current tracepoint
implementation, but this contract between tracepoints and tracers can
evolve as needed.
If we fix the users of tracepoints not to expect preemption to be disabled,
then we could just switch the preempt_disable code (guard(preempt)) to
rcu_read_lock()s for the tracepoint callbacks, right?
There are a few things to consider here about the constraints of the
callsites where the tracepoints are inserted. In general, those need to
be:
- NMI-safe
- notrace
- usable from the scheduler (with rq lock held)
- usable to trace the RCU implementation
Hence the use of guard(preempt_notrace)().
So replacing this by a rcu_read_lock() would lose the "notrace", which
may break some users.
Other than that, I see that the PREEMPT_RCU implementation of
rcu_read_lock/unlock works pretty much similarly to the urcu-mb
flavor of liburcu:
static void rcu_preempt_read_enter(void)
{
WRITE_ONCE(current->rcu_read_lock_nesting, READ_ONCE(current->rcu_read_lock_nesting) + 1);
}
static int rcu_preempt_read_exit(void)
{
int ret = READ_ONCE(current->rcu_read_lock_nesting) - 1;
WRITE_ONCE(current->rcu_read_lock_nesting, ret);
return ret;
}
Technically this was designed to be async-signal safe in userspace, so
I expect this to work in NMI context.
I suspect that the main thing we may be missing here is a rcu_read_lock/unlock_notrace
that similarly to our use of preempt_disable/enable_notrace don't
call into instrumented code from the instrumentation.
There's a one or two places in ftrace that expect it, but I don't know
enough about perf. I don't think BPF needs preemption disabled, but just
migration disabled. I know you had some patches to work around this.
Correct, BPF needs migration disabled AFAIU. Perf/ftrace/lttng would have to
explicitly disable preemption within their callbacks, but that's easily
fixable.
We need to get BPF working without preemption disabled for RT, I'm not sure
how much you know about what needs to be fixed.
Well the first step would be to introduce a rcu_read_lock/unlock_notrace.
This solves the problem at the tracepoint level, but requires that we
initially move the preempt disable to the tracer callbacks. Then we
can figure out within each tracer what needs to be done to further
reduce the preempt off critical section.
I'm not asking for you to do this work, but can you remind me what you saw
when you created the faultable tracepoints?
I saw the future! ;-)
Thanks,
Mathieu
Thanks,
-- Steve
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com