Re: Instrumentation and RCU

From: Thomas Gleixner
Date: Mon Mar 09 2020 - 15:52:34 EST

Next message: Bernd Edlinger: "Re: [PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec"
Previous message: Tejun Heo: "Re: [PATCH v2 1/4] kernfs: kvmalloc xattr value instead of kmalloc"
In reply to: Mathieu Desnoyers: "Re: Instrumentation and RCU"
Next in thread: Mathieu Desnoyers: "Re: Instrumentation and RCU"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> writes:
> ----- On Mar 9, 2020, at 1:02 PM, Thomas Gleixner tglx@xxxxxxxxxxxxx wrote:
>> #1 Fragile low level entry code
>>
>> While I understand the desire of instrumentation to observe
>> everything we really have to ask the question whether it is worth the
>> trouble especially with entry trainwrecks like x86, PTI and other
>> horrors in that area.
>>
>> I don't think so and we really should just bite the bullet and forbid
>> any instrumentation in that code unless it is explicitly designed
>> for that case, makes sense and has a real value from an observation
>> perspective.
>>
>> This is very much related to #3..
>
> Do I understand correctly that you intend on moving all kernel low level
> entry/exit code into sections which cannot be instrumented by kprobes nor
> the function tracer, and require explicit whitelisting, either through
> annotations or use of explicit tracepoints ?

Pretty much so.

>> #2) Breakpoint utilization
>>
>> As recent findings have shown, breakpoint utilization needs to be
>> extremly careful about not creating infinite breakpoint recursions.
>>
>> I think that's pretty much obvious, but falls into the overall
>> question of how to protect callchains.
>
> I think there is another question that arises here: the lack of automated
> continuous testing of the kprobes coverage. We have performed some testing of
> various random permutations of kprobes instrumentation, and have succeeded in
> crashing the kernel in various ways. Unfortunately, that testing is not done
> on a continuous basis, and maintainers understandably have little spare time
> to play the whack-a-mole game of adding missing nokprobes annotations as
> the kernel code evolves.

That's why I think a section approach is more practicable.

>> #3) RCU idle
>> are really more than questionable. For 99.9999% of instrumentation
>> users it's absolutely irrelevant whether this traces the interrupt
>> disabled time of user_exit_irqsoff() or rcu_irq_enter() or not.
>>
>> But what's relevant is the tracer overhead which is e.g. inflicted
>> with todays trace_hardirqs_off/on() implementation because that
>> unconditionally uses the rcuidle variant with the scru/rcu_irq dance
>> around every tracepoint.
>
> I think one of the big issues here is that most of the uses of
> trace_hardirqs_off() are from sites which already have RCU watching,
> so we are doing heavy-weight operations for nothing.

That and in some places in the entry code we do the heavy weight
operations to cover 100 instructions which turn on RCU anyway. That does
not make any sense at all.

> I strongly suspect that most of the overhead we've been trying to avoid when
> introducing use of SRCU in rcuidle tracepoints was actually caused by callsites
> which use rcuidle tracepoints while having RCU watching, just because there is a
> handful of callsites which don't have RCU watching. This is confirmed
> by the commit message of commit e6753f23d9 "tracepoint: Make rcuidle
> tracepoint callers use SRCU":
>
> "In recent tests with IRQ on/off tracepoints, a large performance
> overhead ~10% is noticed when running hackbench. This is root caused to
> calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the
> tracepoint code. Following a long discussion on the list [1] about this,
> we concluded that srcu is a better alternative for use during rcu idle.
> Although it does involve extra barriers, its lighter than the sched-rcu
> version which has to do additional RCU calls to notify RCU idle about
> entry into RCU sections.
> [...]
> Test: Tested idle and preempt/irq tracepoints."

In a quick test I did with a invalid syscall number with profiling the
trace_hardirqs_off() is pretty prominent and goes down by roughly a
factor of 2 when I move it past enter_from_user_mode() and use just the
non RCU idle variant.

> So I think we could go back to plain RCU for rcuidle tracepoints if we do
> the cheaper "rcu_is_watching()" check rather than invoking
> rcu_irq_{enter,exit}_irqson() unconditionally.

Haven't tried that yet for the overall usage of trace_hardirqs_off(),
but yes, it's going to be a measurable difference.

>> Even if the tracepoint sits in the ASM code it just covers about ~20
>> low level ASM instructions more. The tracer invocation, which is
>> even done twice when coming from user space on x86 (the second call
>> is optimized in the tracer C-code), costs definitely way more
>> cycles. When you take the scru/rcu_irq dance into account it's a
>> complete disaster performance wise.
>
> Part of the issue here is the current overhead of SRCU read-side lock,
> which contains memory barriers. The other part of the issue is the fact that
> rcu_irq_{enter,exit}_irqson() contains an atomic_add_return atomic instruction.
>
> We could use the approach proposed by Peterz's and Steven's patches to basically
> do a lightweight "is_rcu_watching()" check for rcuidle tracepoint, and only enable
> RCU for those cases. We could then simply go back on using regular RCU
> like so:

Right, but that still does the whole rcu_irq dance especially in the
entry code just to trace 50 or 100 instructions which are turning on RCU
anyway.

>> #4 Protecting call chains
>>
>> Our current approach of annotating functions with notrace/noprobe is
>> pretty much broken.
>>
>> Functions which are marked NOPROBE or notrace call out into functions
>> which are not marked and while this might be ok, there are enough
>> places where it is not. But we have no way to verify that.
>>
>> That's just a recipe for disaster. We really cannot request from
>> sysadmins who want to use instrumentation to stare at the code first
>> whether they can place/enable an instrumentation point somewhere.
>> That'd be just a bad joke.
>>
>> I really think we need to have proper text sections which are off
>> limit for any form of instrumentation and have tooling to analyze the
>> calls into other sections. These calls need to be annotated as safe
>> and intentional.
>
> If we go all the way into that direction, I suspect it might even make sense
> to duplicate some kernel functions so they can still be part of the code which
> can be instrumented, but provide tracing-specific copy which would be hidden
> from instrumentation. I wonder what would be the size cost of this
> duplication.

That might happen, but we'll see how much of that is truly
requried. It's not horrible large. For the int3 isolation most of it was
fixing up code which should have been protected in the first place and
the only copied code was bsearch which is tiny.

> In addition to splitting tracing code into a separate section, which I think
> makes sense, I can think of another alternative way to provide call chains
> protection: adding a "in_tracing" flag somewhere alongside each kernel stack.
> Each thread and interrupt stack would have its own flag. However, trap handlers
> should share the "in_tracing" flag with the context which triggers the trap.
>
> If a tracer recurses, or if a tracer attempts to trace another tracer, the
> instrumentation would break the recursion chain by preventing instrumentation
> from firing. If we end up caring about tracers tracing other tracers, we could
> have one distinct flag per tracer and let each tracer break the recursion chain.
>
> Having this flag per kernel stack rather than per CPU or per thread would
> allow tracing of nested interrupt handlers (and NMIs), but would break
> call chains both within the same stack or going through a trap. I think
> it could be a nice complementary safety net to handle mishaps in a non-fatal
> way.

That works as long as none of this uses breakpoint based patching to
dynamically disable/enable stuff.

Thanks,

tglx

Next message: Bernd Edlinger: "Re: [PATCH v2 4/5] exec: Move exec_mmap right after de_thread in flush_old_exec"
Previous message: Tejun Heo: "Re: [PATCH v2 1/4] kernfs: kvmalloc xattr value instead of kmalloc"
In reply to: Mathieu Desnoyers: "Re: Instrumentation and RCU"
Next in thread: Mathieu Desnoyers: "Re: Instrumentation and RCU"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]