Re: [PATCH v12 0/3] tracing: Centralize preemptirq tracepoints and unify their usage

From: Joel Fernandes
Date: Mon Aug 06 2018 - 11:24:24 EST


On Sun, Aug 5, 2018 at 7:07 PM, Masami Hiramatsu <mhiramat@xxxxxxxxxx> wrote:
> On Sun, 5 Aug 2018 09:46:56 -0700
> Joel Fernandes <joelaf@xxxxxxxxxx> wrote:
>
>> Hi Masami,
>>
>> On Fri, Aug 3, 2018 at 9:51 PM, Joel Fernandes <joelaf@xxxxxxxxxx> wrote:
>> [...]
>> >> On Thu, 2 Aug 2018 19:57:09 -0700
>> >> Joel Fernandes <joelaf@xxxxxxxxxx> wrote:
>> >>
>> >>> Hi Masami,
>> >>>
>> >>> On Thu, Aug 2, 2018 at 7:55 AM, Masami Hiramatsu <mhiramat@xxxxxxxxxx> wrote:
>> >>> > Hi Joel,
>> >>> >
>> >>> > I found this caused several issues when testing ftrace.
>> >>> >
>> >>> > #1) ftrace boottest (FTRACE_STARTUP_TEST) fails
>> >>>
>> >>> This sadly appears to be a real issue. The startup test for
>> >>> "preemptirqsoff" tracer fails, however it passes for only preemptoff
>> >>> or only irqsoff. I tested only the last 2 tracers, not the first one,
>> >>> that's why I didn't catch it. I need to debug this more.
>>
>> I figured out this one too. Its because I need to account for
>> preempt_count() in tracer_hardirqs_off since the tracer probe is now
>> called with an additional level of preempt disabled from the
>> tracepoint code. Without that accounting, stop_critical_timings may
>> not be called causing an empty trace buffer. That should be easy to
>> fix, I'm on vacation though and back on 13th so can most likely look
>> at it only then (the week after the next).
>
> Nice! Thank you for identifying the root cause!

No problem :) Thanks a lot for your help!

>> >>> > #2) mmiotrace reports "IRQs not enabled as expected" error

I fixed #2 too now, the attached patch based on ftrace/core should fix
it (Sorry couldn't post the patch inline because I'm not at a sane
email client right now).

The patch is still in RFC shape, but I verified it fixes the issue.

Basically, I do this in the patch:
- switch to using SRCU for all tracepoints, use a separate SRCU node
for NMI context (Paul, Mathieu would love to get your thoughts here).
This is to remove dependence on sched-RCU.
- use tracepoints even after CPUs are offline

>> >>> > #3) lock subsystem event boottest causes "IRQs not disabled as expected" error (sometimes)
>> The only thing left to figure out is #3 ("lock subsystem event
>> boottest"). Could you let me know how to run this test?
>
> The #3 is not always reproducible, I just enabled these 2 configs
>
> CONFIG_FTRACE_STARTUP_TEST=y and CONFIG_EVENT_TRACE_TEST_SYSCALLS=y

Thanks, I am also able to see this now some of the times. I have
certain builds where this is much faster to reproduce so I'll use
those.

- Joel
From 6986af946ceb04fc9ddc6d5b45fc559b6807e465 Mon Sep 17 00:00:00 2001
From: "Joel Fernandes (Google)" <joel@xxxxxxxxxxxxxxxxx>
Date: Sun, 5 Aug 2018 20:17:41 -0700
Subject: [PATCH] tracepoint: Run tracepoints even after CPU is offline

Commit f37755490fe9 ("tracepoints: Do not trace when cpu is offline")
causes a problem for lockdep using tracepoint code. Once a CPU is
offline, tracepoints donot get called, however this causes a big problem
for lockdep probes that need to run so that IRQ annotations are marked
correctly.

A race is possible where while the CPU is going offline, an interrupt
can come in and then a lockdep assert causes an annotation warning:

[ 106.551354] IRQs not enabled as expected
[ 106.551785] WARNING: CPU: 1 PID: 0 at kernel/time/tick-sched.c:982
tick_nohz_idle_enter+0x99/0xb0
[ 106.552964] Modules linked in:
[ 106.553299] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W

We need tracepoints to run as late as possible. This commit fixes the
issue by removing the cpu_online check in tracepoint code that was
introduced in the mentioned commit, however we now switch to using SRCU
for all tracepoints and special handle calling tracepoints from NMI so
that we don't run into issues that result from using sched-RCU when the
CPUs are marked to be offline.

Fixes: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and
unify their usage")
Reported-by: Masami Hiramatsu <mhiramat@xxxxxxxxxx>
Signed-off-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>
---
include/linux/tracepoint.h | 27 +++++++++++----------------
kernel/tracepoint.c | 10 ++++++----
2 files changed, 17 insertions(+), 20 deletions(-)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index d9a084c72541..5733502bb3ce 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -35,6 +35,7 @@ struct trace_eval_map {
#define TRACEPOINT_DEFAULT_PRIO 10

extern struct srcu_struct tracepoint_srcu;
+extern struct srcu_struct tracepoint_srcu_nmi;

extern int
tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data);
@@ -144,13 +145,11 @@ extern void syscall_unregfunc(void);
void *it_func; \
void *__data; \
int __maybe_unused idx = 0; \
+ struct srcu_struct *ss; \
\
if (!(cond)) \
return; \
\
- /* srcu can't be used from NMI */ \
- WARN_ON_ONCE(rcuidle && in_nmi()); \
- \
/* keep srcu and sched-rcu usage consistent */ \
preempt_disable_notrace(); \
\
@@ -159,7 +158,11 @@ extern void syscall_unregfunc(void);
* doesn't work from the idle path. \
*/ \
if (rcuidle) \
- idx = srcu_read_lock_notrace(&tracepoint_srcu); \
+ ss = &tracepoint_srcu_nmi; \
+ else \
+ ss = &tracepoint_srcu; \
+ \
+ idx = srcu_read_lock_notrace(ss); \
\
it_func_ptr = rcu_dereference_raw((tp)->funcs); \
\
@@ -171,8 +174,7 @@ extern void syscall_unregfunc(void);
} while ((++it_func_ptr)->func); \
} \
\
- if (rcuidle) \
- srcu_read_unlock_notrace(&tracepoint_srcu, idx);\
+ srcu_read_unlock_notrace(ss, idx); \
\
preempt_enable_notrace(); \
} while (0)
@@ -212,11 +214,6 @@ extern void syscall_unregfunc(void);
TP_PROTO(data_proto), \
TP_ARGS(data_args), \
TP_CONDITION(cond), 0); \
- if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) { \
- rcu_read_lock_sched_notrace(); \
- rcu_dereference_sched(__tracepoint_##name.funcs);\
- rcu_read_unlock_sched_notrace(); \
- } \
} \
__DECLARE_TRACE_RCU(name, PARAMS(proto), PARAMS(args), \
PARAMS(cond), PARAMS(data_proto), PARAMS(data_args)) \
@@ -365,19 +362,17 @@ extern void syscall_unregfunc(void);
* "void *__data, proto" as the callback prototype.
*/
#define DECLARE_TRACE_NOARGS(name) \
- __DECLARE_TRACE(name, void, , \
- cpu_online(raw_smp_processor_id()), \
+ __DECLARE_TRACE(name, void, , 1, \
void *__data, __data)

#define DECLARE_TRACE(name, proto, args) \
- __DECLARE_TRACE(name, PARAMS(proto), PARAMS(args), \
- cpu_online(raw_smp_processor_id()), \
+ __DECLARE_TRACE(name, PARAMS(proto), PARAMS(args), 1, \
PARAMS(void *__data, proto), \
PARAMS(__data, args))

#define DECLARE_TRACE_CONDITION(name, proto, args, cond) \
__DECLARE_TRACE(name, PARAMS(proto), PARAMS(args), \
- cpu_online(raw_smp_processor_id()) && (PARAMS(cond)), \
+ PARAMS(cond), \
PARAMS(void *__data, proto), \
PARAMS(__data, args))

diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 955148d91b74..769d74b2f90e 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -32,7 +32,9 @@ extern struct tracepoint * const __start___tracepoints_ptrs[];
extern struct tracepoint * const __stop___tracepoints_ptrs[];

DEFINE_SRCU(tracepoint_srcu);
+DEFINE_SRCU(tracepoint_srcu_nmi);
EXPORT_SYMBOL_GPL(tracepoint_srcu);
+EXPORT_SYMBOL_GPL(tracepoint_srcu_nmi);

/* Set to 1 to enable tracepoint debug output */
static const int tracepoint_debug;
@@ -70,14 +72,14 @@ static inline void *allocate_probes(int count)
return p == NULL ? NULL : p->probes;
}

-static void srcu_free_old_probes(struct rcu_head *head)
+static void srcu_free_old_probes_nmi(struct rcu_head *head)
{
kfree(container_of(head, struct tp_probes, rcu));
}

-static void rcu_free_old_probes(struct rcu_head *head)
+static void srcu_free_old_probes(struct rcu_head *head)
{
- call_srcu(&tracepoint_srcu, head, srcu_free_old_probes);
+ call_srcu(&tracepoint_srcu_nmi, head, srcu_free_old_probes_nmi);
}

static inline void release_probes(struct tracepoint_func *old)
@@ -91,7 +93,7 @@ static inline void release_probes(struct tracepoint_func *old)
* cover both cases. So let us chain the SRCU and sched RCU
* callbacks to wait for both grace periods.
*/
- call_rcu_sched(&tp_probes->rcu, rcu_free_old_probes);
+ call_srcu(&tracepoint_srcu, &tp_probes->rcu, srcu_free_old_probes);
}
}

--
2.18.0.597.ga71716f1ad-goog