[RFC 3/3] preempt_tracer: Use a percpu variable to control traceble calls

From: Daniel Bristot de Oliveira
Date: Tue May 28 2019 - 11:20:31 EST


The preempt_disable tracepoint only traces in the disable <-> enable case.
Which is correct, but think about this case:

--------------------------- %< ----------------------
THREAD IRQ
| |
preempt_disable_notrace() {
__preempt_count_add(1)
}

/* preemption disabled/IRQs enabled */

-------> smp_apic_timer_interrupt() {
preempt_disable() {
do not trace (preempt count >= 1)
}
....
preempt_enable() {
do not trace (preempt count >= 1)
}
<------- }
preempt_enable_notrace() {
__preempt_count_sub(1)
}
--------------------------- >% ----------------------

The non-traceble preempt_disable can hide a legit preempt_disable (which
is worth tracing).

It is possible to observe this problem using this kernel module:

http://bristot.me/files/efficient_verification/wip.tar.gz

and doing the following trace:

# cd /sys/kernel/debug/tracing/
# echo 1 > snapshot
# cat available_events | grep preempt_ > set_event
# echo irq_vectors >> /sys/kernel/debug/tracing/set_event
# insmod wip.ko
/* wait for a snapshot creation */
# tail -100 snapshot
sshd-1159 [000] d...1.. 2440.866116: preempt_enable: caller=migrate_enable+0x1bb/0x330 parent=migrate_enable+0x1bb/0x330
sshd-1159 [000] d..h1.. 2440.866122: local_timer_entry: vector=236
sshd-1159 [000] d..h1.. 2440.866127: local_timer_exit: vector=236
sshd-1159 [000] d.L.4.. 2440.866129: process_event: event sched_waking not expected in the state preemptive
sshd-1159 [000] d.L.4.. 2440.866137: <stack trace>
=> process_event
=> __handle_event
=> ttwu_do_wakeup
=> try_to_wake_up
=> irq_exit
=> smp_apic_timer_interrupt
=> apic_timer_interrupt
=> kvm_clock_read
=> ktime_get_with_offset
=> posix_get_boottime
=> __x64_sys_clock_gettime
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe

and kvm_clock_read() disables preemption without tracing:

--------------------------- %< ----------------------
static u64 kvm_clock_read(void)
{
u64 ret;

preempt_disable_notrace();
ret = pvclock_clocksource_read(this_cpu_pvti());
preempt_enable_notrace();
return ret;
}
--------------------------- >% ----------------------

Use a percpu variable for the traced preempt_disable/enable, and use it
to decide whether trace or not.

Signed-off-by: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>
Cc: "Steven Rostedt (VMware)" <rostedt@xxxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx>
Cc: Matthias Kaehlcke <mka@xxxxxxxxxxxx>
Cc: "Joel Fernandes (Google)" <joel@xxxxxxxxxxxxxxxxx>
Cc: Frederic Weisbecker <frederic@xxxxxxxxxx>
Cc: Yangtao Li <tiny.windzz@xxxxxxxxx>
Cc: Tommaso Cucinotta <tommaso.cucinotta@xxxxxxxxxxxxxxx>
Cc: linux-kernel@xxxxxxxxxxxxxxx
---
kernel/sched/core.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index be4117d7384f..2e07d4174778 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3148,19 +3148,25 @@ static inline void sched_tick_stop(int cpu) { }

#if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \
defined(CONFIG_TRACE_PREEMPT_TOGGLE))
+
+DEFINE_PER_CPU(int, __traced_preempt_count) = 0;
/*
* If the value passed in is equal to the current preempt count
* then we just disabled preemption. Start timing the latency.
*/
void preempt_latency_start(int val)
{
- if (preempt_count() == val) {
+ int curr = this_cpu_read(__traced_preempt_count);
+
+ if (!curr) {
unsigned long ip = get_lock_parent_ip();
#ifdef CONFIG_DEBUG_PREEMPT
current->preempt_disable_ip = ip;
#endif
trace_preempt_off(CALLER_ADDR0, ip);
}
+
+ this_cpu_write(__traced_preempt_count, curr + val);
}

static inline void preempt_add_start_latency(int val)
@@ -3200,8 +3206,12 @@ NOKPROBE_SYMBOL(preempt_count_add);
*/
void preempt_latency_stop(int val)
{
- if (preempt_count() == val)
+ int curr = this_cpu_read(__traced_preempt_count) - val;
+
+ if (!curr)
trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());
+
+ this_cpu_write(__traced_preempt_count, curr);
}

static inline void preempt_sub_stop_latency(int val)
--
2.20.1