Re: Instrumentation and RCU

From: Mathieu Desnoyers
Date: Mon Mar 09 2020 - 14:37:44 EST

Next message: Daniel Xu: "[PATCH] xattr: NULL initialize name in simple_xattr_alloc"
Previous message: Matthew Wilcox: "Re: [PATCH] mm: clear 1G pages with streaming stores on x86"
In reply to: Daniel Thompson: "Re: Instrumentation and RCU"
Next in thread: Steven Rostedt: "Re: Instrumentation and RCU"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

----- On Mar 9, 2020, at 1:02 PM, Thomas Gleixner tglx@xxxxxxxxxxxxx wrote:

> Folks,
>
> I'm starting a new conversation because there are about 20 different
> threads which look at that problem in various ways and the information
> is so scattered that creating a coherent picture is pretty much
> impossible.
>
> There are several problems to solve:
>
> 1) Fragile low level entry code
>
> 2) Breakpoint utilization
>
> 3) RCU idle
>
> 4) Callchain protection
>
> #1 Fragile low level entry code
>
> While I understand the desire of instrumentation to observe
> everything we really have to ask the question whether it is worth the
> trouble especially with entry trainwrecks like x86, PTI and other
> horrors in that area.
>
> I don't think so and we really should just bite the bullet and forbid
> any instrumentation in that code unless it is explicitly designed
> for that case, makes sense and has a real value from an observation
> perspective.
>
> This is very much related to #3..

Do I understand correctly that you intend on moving all kernel low level
entry/exit code into sections which cannot be instrumented by kprobes nor
the function tracer, and require explicit whitelisting, either through
annotations or use of explicit tracepoints ?

>
> #2) Breakpoint utilization
>
> As recent findings have shown, breakpoint utilization needs to be
> extremly careful about not creating infinite breakpoint recursions.
>
> I think that's pretty much obvious, but falls into the overall
> question of how to protect callchains.

I think there is another question that arises here: the lack of automated
continuous testing of the kprobes coverage. We have performed some testing of
various random permutations of kprobes instrumentation, and have succeeded in
crashing the kernel in various ways. Unfortunately, that testing is not done
on a continuous basis, and maintainers understandably have little spare time
to play the whack-a-mole game of adding missing nokprobes annotations as
the kernel code evolves.

>
> #3) RCU idle
>
> Being able to trace code inside RCU idle sections is very similar to
> the question raised in #1.
>
> Assume all of the instrumentation would be doing conditional RCU
> schemes, i.e.:
>
> if (rcuidle)
> ....
> else
> rcu_read_lock_sched()
>
> before invoking the actual instrumentation functions and of course
> undoing that right after it, that really begs the question whether
> it's worth it.
>
> Especially constructs like:
>
> trace_hardirqs_off()
> idx = srcu_read_lock()
> rcu_irq_enter_irqson();
> ...
> rcu_irq_exit_irqson();
> srcu_read_unlock(idx);
>
> if (user_mode)
> user_exit_irqsoff();
> else
> rcu_irq_enter();
>
> are really more than questionable. For 99.9999% of instrumentation
> users it's absolutely irrelevant whether this traces the interrupt
> disabled time of user_exit_irqsoff() or rcu_irq_enter() or not.
>
> But what's relevant is the tracer overhead which is e.g. inflicted
> with todays trace_hardirqs_off/on() implementation because that
> unconditionally uses the rcuidle variant with the scru/rcu_irq dance
> around every tracepoint.

I think one of the big issues here is that most of the uses of
trace_hardirqs_off() are from sites which already have RCU watching,
so we are doing heavy-weight operations for nothing.

I strongly suspect that most of the overhead we've been trying to avoid when
introducing use of SRCU in rcuidle tracepoints was actually caused by callsites
which use rcuidle tracepoints while having RCU watching, just because there is a
handful of callsites which don't have RCU watching. This is confirmed
by the commit message of commit e6753f23d9 "tracepoint: Make rcuidle
tracepoint callers use SRCU":

"In recent tests with IRQ on/off tracepoints, a large performance
overhead ~10% is noticed when running hackbench. This is root caused to
calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the
tracepoint code. Following a long discussion on the list [1] about this,
we concluded that srcu is a better alternative for use during rcu idle.
Although it does involve extra barriers, its lighter than the sched-rcu
version which has to do additional RCU calls to notify RCU idle about
entry into RCU sections.
[...]
Test: Tested idle and preempt/irq tracepoints."

So I think we could go back to plain RCU for rcuidle tracepoints if we do
the cheaper "rcu_is_watching()" check rather than invoking
rcu_irq_{enter,exit}_irqson() unconditionally.

> Even if the tracepoint sits in the ASM code it just covers about ~20
> low level ASM instructions more. The tracer invocation, which is
> even done twice when coming from user space on x86 (the second call
> is optimized in the tracer C-code), costs definitely way more
> cycles. When you take the scru/rcu_irq dance into account it's a
> complete disaster performance wise.

Part of the issue here is the current overhead of SRCU read-side lock,
which contains memory barriers. The other part of the issue is the fact that
rcu_irq_{enter,exit}_irqson() contains an atomic_add_return atomic instruction.

We could use the approach proposed by Peterz's and Steven's patches to basically
do a lightweight "is_rcu_watching()" check for rcuidle tracepoint, and only enable
RCU for those cases. We could then simply go back on using regular RCU like so:

#define __DO_TRACE(tp, proto, args, cond, rcuidle) \
do { \
struct tracepoint_func *it_func_ptr; \
void *it_func; \
void *__data; \
bool exit_rcu = false; \
\
if (!(cond)) \
return; \
\
if (rcuidle && !rcu_is_watching()) { \
rcu_irq_enter_irqson(); \
exit_rcu = true; \
} \
preempt_disable_notrace(); \
it_func_ptr = rcu_dereference_raw((tp)->funcs); \
if (it_func_ptr) { \
do { \
it_func = (it_func_ptr)->func; \
__data = (it_func_ptr)->data; \
((void(*)(proto))(it_func))(args); \
} while ((++it_func_ptr)->func); \
} \
preempt_enable_notrace(); \
if (exit_rcu) \
rcu_irq_exit_irqson(); \
} while (0)

> #4 Protecting call chains
>
> Our current approach of annotating functions with notrace/noprobe is
> pretty much broken.
>
> Functions which are marked NOPROBE or notrace call out into functions
> which are not marked and while this might be ok, there are enough
> places where it is not. But we have no way to verify that.
>
> That's just a recipe for disaster. We really cannot request from
> sysadmins who want to use instrumentation to stare at the code first
> whether they can place/enable an instrumentation point somewhere.
> That'd be just a bad joke.
>
> I really think we need to have proper text sections which are off
> limit for any form of instrumentation and have tooling to analyze the
> calls into other sections. These calls need to be annotated as safe
> and intentional.

If we go all the way into that direction, I suspect it might even make sense
to duplicate some kernel functions so they can still be part of the code which
can be instrumented, but provide tracing-specific copy which would be hidden
from instrumentation. I wonder what would be the size cost of this duplication.

In addition to splitting tracing code into a separate section, which I think
makes sense, I can think of another alternative way to provide call chains
protection: adding a "in_tracing" flag somewhere alongside each kernel stack.
Each thread and interrupt stack would have its own flag. However, trap handlers
should share the "in_tracing" flag with the context which triggers the trap.

If a tracer recurses, or if a tracer attempts to trace another tracer, the
instrumentation would break the recursion chain by preventing instrumentation
from firing. If we end up caring about tracers tracing other tracers, we could
have one distinct flag per tracer and let each tracer break the recursion chain.

Having this flag per kernel stack rather than per CPU or per thread would
allow tracing of nested interrupt handlers (and NMIs), but would break
call chains both within the same stack or going through a trap. I think
it could be a nice complementary safety net to handle mishaps in a non-fatal
way.

Thanks,

Mathieu

>
> Thoughts?
>
> Thanks,
>
> tglx

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Next message: Daniel Xu: "[PATCH] xattr: NULL initialize name in simple_xattr_alloc"
Previous message: Matthew Wilcox: "Re: [PATCH] mm: clear 1G pages with streaming stores on x86"
In reply to: Daniel Thompson: "Re: Instrumentation and RCU"
Next in thread: Steven Rostedt: "Re: Instrumentation and RCU"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]