Re: Instrumentation and RCU

From: Mathieu Desnoyers
Date: Tue Mar 10 2020 - 11:03:19 EST


----- On Mar 9, 2020, at 3:52 PM, Thomas Gleixner tglx@xxxxxxxxxxxxx wrote:

> Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> writes:
>> ----- On Mar 9, 2020, at 1:02 PM, Thomas Gleixner tglx@xxxxxxxxxxxxx wrote:
[...]
>
>>> #3) RCU idle
>>> are really more than questionable. For 99.9999% of instrumentation
>>> users it's absolutely irrelevant whether this traces the interrupt
>>> disabled time of user_exit_irqsoff() or rcu_irq_enter() or not.
>>>
>>> But what's relevant is the tracer overhead which is e.g. inflicted
>>> with todays trace_hardirqs_off/on() implementation because that
>>> unconditionally uses the rcuidle variant with the scru/rcu_irq dance
>>> around every tracepoint.
>>
>> I think one of the big issues here is that most of the uses of
>> trace_hardirqs_off() are from sites which already have RCU watching,
>> so we are doing heavy-weight operations for nothing.
>
> That and in some places in the entry code we do the heavy weight
> operations to cover 100 instructions which turn on RCU anyway. That does
> not make any sense at all.
>
>> I strongly suspect that most of the overhead we've been trying to avoid when
>> introducing use of SRCU in rcuidle tracepoints was actually caused by callsites
>> which use rcuidle tracepoints while having RCU watching, just because there is a
>> handful of callsites which don't have RCU watching. This is confirmed
>> by the commit message of commit e6753f23d9 "tracepoint: Make rcuidle
>> tracepoint callers use SRCU":
>>
>> "In recent tests with IRQ on/off tracepoints, a large performance
>> overhead ~10% is noticed when running hackbench. This is root caused to
>> calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the
>> tracepoint code. Following a long discussion on the list [1] about this,
>> we concluded that srcu is a better alternative for use during rcu idle.
>> Although it does involve extra barriers, its lighter than the sched-rcu
>> version which has to do additional RCU calls to notify RCU idle about
>> entry into RCU sections.
>> [...]
>> Test: Tested idle and preempt/irq tracepoints."
>
> In a quick test I did with a invalid syscall number with profiling the
> trace_hardirqs_off() is pretty prominent and goes down by roughly a
> factor of 2 when I move it past enter_from_user_mode() and use just the
> non RCU idle variant.

I think one issue here is that trace_hardirqs_off() is now shared between
lockdep and tracing. For lockdep, we have the following comment:

/*
* IRQ from user mode.
*
* We need to tell lockdep that IRQs are off. We can't do this until
* we fix gsbase, and we should do it before enter_from_user_mode
* (which can take locks). Since TRACE_IRQS_OFF is idempotent,
* the simplest way to handle it is to just call it twice if
* we enter from user mode. There's no reason to optimize this since
* TRACE_IRQS_OFF is a no-op if lockdep is off.
*/
TRACE_IRQS_OFF

CALL_enter_from_user_mode

1:
ENTER_IRQ_STACK old_rsp=%rdi save_ret=1
/* We entered an interrupt context - irqs are off: */
TRACE_IRQS_OFF

which seems to imply that lockdep requires TRACE_IRQS_OFF to be performed
_before_ entering from usermode. I don't expect this to be useful at all for
other tracers though. I think this should be replaced by a new e.g.
LOCKDEP_ENTER_FROM_USER_MODE or such which would call into lockdep without
calling other tracers.

>
>> So I think we could go back to plain RCU for rcuidle tracepoints if we do
>> the cheaper "rcu_is_watching()" check rather than invoking
>> rcu_irq_{enter,exit}_irqson() unconditionally.
>
> Haven't tried that yet for the overall usage of trace_hardirqs_off(),
> but yes, it's going to be a measurable difference.

Ideally we should do both: change the TRACE_IRQS_OFF prior entering from usermode
for a lockdep-specific call, which would remove the frequent case which requires
temporarily enabling RCU, *and* change tracepoints to use rcu_is_watching(),
temporarily enable RCU, and use standard RCU for rcuidle cases.

>
>>> Even if the tracepoint sits in the ASM code it just covers about ~20
>>> low level ASM instructions more. The tracer invocation, which is
>>> even done twice when coming from user space on x86 (the second call
>>> is optimized in the tracer C-code), costs definitely way more
>>> cycles. When you take the scru/rcu_irq dance into account it's a
>>> complete disaster performance wise.
>>
>> Part of the issue here is the current overhead of SRCU read-side lock,
>> which contains memory barriers. The other part of the issue is the fact that
>> rcu_irq_{enter,exit}_irqson() contains an atomic_add_return atomic instruction.
>>
>> We could use the approach proposed by Peterz's and Steven's patches to basically
>> do a lightweight "is_rcu_watching()" check for rcuidle tracepoint, and only
>> enable
>> RCU for those cases. We could then simply go back on using regular RCU
>> like so:
>
> Right, but that still does the whole rcu_irq dance especially in the
> entry code just to trace 50 or 100 instructions which are turning on RCU
> anyway.

Agreed. Would changing this to a lockdep-specific call as I suggest above
solve this ?

>
>>> #4 Protecting call chains
[...]
>
>> In addition to splitting tracing code into a separate section, which I think
>> makes sense, I can think of another alternative way to provide call chains
>> protection: adding a "in_tracing" flag somewhere alongside each kernel stack.
>> Each thread and interrupt stack would have its own flag. However, trap handlers
>> should share the "in_tracing" flag with the context which triggers the trap.
>>
>> If a tracer recurses, or if a tracer attempts to trace another tracer, the
>> instrumentation would break the recursion chain by preventing instrumentation
>> from firing. If we end up caring about tracers tracing other tracers, we could
>> have one distinct flag per tracer and let each tracer break the recursion chain.
>>
>> Having this flag per kernel stack rather than per CPU or per thread would
>> allow tracing of nested interrupt handlers (and NMIs), but would break
>> call chains both within the same stack or going through a trap. I think
>> it could be a nice complementary safety net to handle mishaps in a non-fatal
>> way.
>
> That works as long as none of this uses breakpoint based patching to
> dynamically disable/enable stuff.

I'm clearly missing something here. I was expecting the "in_tracing" flag trick
to be able to fix the breakpoint recursion issue. What is the problem I'm missing
here ?

An additional point about splitting core code into separate non-instrumentable
sections: The part I'm concerned about is instrumentation of trap handlers.
_That_ will likely need to remain shared between the kernel and the tracers.
This is where I think the "in_tracing" flag approach would help us solve the
recursion issue without losing that important instrumentation coverage.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com