Re: [RFC v4 3/4] irqflags: Avoid unnecessary calls to trace_ if you can

From: Mathieu Desnoyers
Date: Mon Apr 23 2018 - 10:31:38 EST


----- On Apr 22, 2018, at 11:19 PM, Paul E. McKenney paulmck@xxxxxxxxxxxxxxxxxx wrote:

> On Sun, Apr 22, 2018 at 06:14:18PM -0700, Joel Fernandes wrote:
>> On Fri, Apr 20, 2018 at 12:07 AM, Joel Fernandes <joelaf@xxxxxxxxxx> wrote:
>> > Hi,
>> >
>> > Thanks Matsami and Namhyung for the suggestions!
>> >
>> > On Wed, Apr 18, 2018 at 10:43 PM, Namhyung Kim <namhyung@xxxxxxxxxx> wrote:
>> >> On Wed, Apr 18, 2018 at 06:02:50PM +0900, Masami Hiramatsu wrote:
>> >>> On Mon, 16 Apr 2018 21:07:47 -0700
>> >>> Joel Fernandes <joelaf@xxxxxxxxxx> wrote:
>> >>>
>> >>> > With TRACE_IRQFLAGS, we call trace_ API too many times. We don't need
>> >>> > to if local_irq_restore or local_irq_save didn't actually do anything.
>> >>> >
>> >>> > This gives around a 4% improvement in performance when doing the
>> >>> > following command: "time find / > /dev/null"
>> >>> >
>> >>> > Also its best to avoid these calls where possible, since in this series,
>> >>> > the RCU code in tracepoint.h seems to be call these quite a bit and I'd
>> >>> > like to keep this overhead low.
>> >>>
>> >>> Can we assume that the "flags" has only 1 bit irq-disable flag?
>> >>> Since it skips calling raw_local_irq_restore(flags); too,
>> >>
>> >> I don't know how many it impacts on performance but maybe we can have
>> >> an arch-specific config option something like below?
>> >
>> > The flags restoration I am hoping is "cheap" but I haven't measured
>> > specifically the cost of this though.
>> >
>> >>
>> >>
>> >>> if there is any state in the flags on any arch, it may change the
>> >>> result. In that case, we can do it as below (just skipping trace_hardirqs_*)
>> >>>
>> >>> int disabled = irqs_disabled();
>> >>
>> >> if (disabled == raw_irqs_disabled_flags(flags)) {
>> >> #ifndef CONFIG_ARCH_CAN_SKIP_NESTED_IRQ_RESTORE
>> >> raw_local_irq_restore(flags);
>> >> #endif
>> >> return;
>> >> }
>> >
>> > Hmm, somehow I feel this part should be written generically enough
>> > that it applies to all architectures (as a first step).
>> >
>> >>
>> >>>
>> >>> if (!raw_irqs_disabled_flags(flags) && disabled)
>> >>> trace_hardirqs_on();
>> >>>
>> >>> raw_local_irq_restore(flags);
>> >>>
>> >>> if (raw_irqs_disabled_flags(flags) && !disabled)
>> >>> trace_hardirqs_off();
>> >
>> > I like this idea since its a good thing to do the flag restoration
>> > just to be safe and preserve the current behaviors. Also my goal was
>> > to reduce the trace_ calls in this series, so its probably better I
>> > just do as you're suggesting. I will do some experiments and make the
>> > changes for the next series.
>>
>> So about performance of this series..
>>
>> lockdep hooking into tracepoint code is a bit heavy, compared to
>> without this series. That's because of the design approach of
>> IRQ on/off -> Trace point -> lockdep
>>
>> Versus without this series which does
>> IRQ on/off -> lockdep
>>
>> So we lose performance because of that.
>>
>> This particular patch improves the situation, as such so this
>> particular patch is probably good to merge once we can test
>> performance of Matsami's suggestion as well.
>>
>> However, patch 4/4 which makes lockdep use the tracepoint causes a
>> performance hit of around 8% of mean time when I run:
>> hackbench -g 4 -f 2 -l 30000
>>
>> I narrowed the performance hit down to the call to
>> rcu_irq_enter_irqson() and rcu_irq_exit_irqson() in __DO_TRACE.
>> Commenting these 2 functions brings the perf level back.
>>
>> I was thinking about RCU usage here, and really we never change this
>> particular performance-sensitive tracepoint's function table 99.9% of
>> the time, so it seems there's quite in a win if we just had another
>> read-mostly synchronization mechanism that doesn't do all the RCU
>> tracking that's currently done here and such a mechanism can be
>> simpler..
>>
>> If I understand correctly, RCU also adds other complications such as
>> that it can't be used from the idle path, that's why the
>> rcu_irq_enter_* was added in the first place. Would be nice if we can
>> just avoid these RCU calls for the preempt/irq tracepoints... Any
>> thoughts about this or any other ideas to solve this?
>
> In theory, the tracepoint code could use SRCU instead of RCU, given that
> SRCU readers can be in the idle loop, although at the expense of a couple
> of smp_mb() calls in each tracepoint. In practice, I must defer to the
> people who know the tracepoint code better than I.

I've been wanting to introduce an alternative tracepoint instrumentation
"flavor" for e.g. system call entry/exit which rely on SRCU rather than
sched-rcu (preempt-off). This would allow taking faults within the instrumentation
probe, which makes lots of things easier when fetching data from user-space
upon system call entry/exit. This could also be used to cleanly instrument
the idle loop.

I would be tempted to proceed carefully and introduce a new kind of SRCU
tracepoint rather than changing all existing ones from sched-rcu to SRCU
though.

So the lockdep stuff could use the SRCU tracepoint flavor, which I guess
would be faster than the rcu_irq_enter_*().

Thanks,

Mathieu


>
> Thanx, Paul
>
>> Meanwhile I'll also do some performance testing with Matsami's idea as well..
>>
>> thanks,
>>
>> - Joel

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com