[PATCH] trace: ched_switch event variant with raw monotonic timestamp

From: Pawel Moll
Date: Fri Aug 22 2014 - 10:23:13 EST


This patch adds a variant of sched_switch trace event,
adding a field for a monotonic raw timestamp.

Signed-off-by: Pawel Moll <pawel.moll@xxxxxxx>
---
include/trace/events/sched.h | 48 ++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/core.c | 1 +
2 files changed, 49 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 0a68d5a..1429958 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -7,6 +7,7 @@
#include <linux/sched.h>
#include <linux/tracepoint.h>
#include <linux/binfmts.h>
+#include <linux/time.h>

/*
* Tracepoint for calling kthread_stop, performed to end a kthread:
@@ -150,6 +151,53 @@ TRACE_EVENT(sched_switch,
);

/*
+ * Tracepoint for task switches, including raw monotonic raw clock
+ * timestamp, to perform user- to kernelspace time correlation
+ */
+TRACE_EVENT(sched_switch_raw_monotonic,
+
+ TP_PROTO(struct task_struct *prev,
+ struct task_struct *next),
+
+ TP_ARGS(prev, next),
+
+ TP_STRUCT__entry(
+ __field( u64, raw_monotonic )
+ __array( char, prev_comm, TASK_COMM_LEN )
+ __field( pid_t, prev_pid )
+ __field( int, prev_prio )
+ __field( long, prev_state )
+ __array( char, next_comm, TASK_COMM_LEN )
+ __field( pid_t, next_pid )
+ __field( int, next_prio )
+ ),
+
+ TP_fast_assign(
+ struct timespec now;
+ getrawmonotonic(&now);
+ __entry->raw_monotonic = timespec_to_ns(&now);
+ memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
+ __entry->prev_pid = prev->pid;
+ __entry->prev_prio = prev->prio;
+ __entry->prev_state = __trace_sched_switch_state(prev);
+ memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
+ __entry->next_pid = next->pid;
+ __entry->next_prio = next->prio;
+ ),
+
+ TP_printk("raw_monotonic=%Lu prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
+ __entry->raw_monotonic, __entry->prev_comm,
+ __entry->prev_pid, __entry->prev_prio,
+ __entry->prev_state & (TASK_STATE_MAX-1) ?
+ __print_flags(__entry->prev_state & (TASK_STATE_MAX-1), "|",
+ { 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },
+ { 16, "Z" }, { 32, "X" }, { 64, "x" },
+ { 128, "K" }, { 256, "W" }, { 512, "P" }) : "R",
+ __entry->prev_state & TASK_STATE_MAX ? "+" : "",
+ __entry->next_comm, __entry->next_pid, __entry->next_prio)
+);
+
+/*
* Tracepoint for a task being migrated:
*/
TRACE_EVENT(sched_migrate_task,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ec1a286..0cc7a92 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2179,6 +2179,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{
trace_sched_switch(prev, next);
+ trace_sched_switch_raw_monotonic(prev, next);
sched_info_switch(rq, prev, next);
perf_event_task_sched_out(prev, next);
fire_sched_out_preempt_notifiers(prev, next);
--
1.9.1
8<-------------

And than did some experiment, opening a perf session with this event and
executing the following test:

clock_gettime(CLOCK_MONOTONIC_RAW, &t_m_1);
write(marker, "test", 4);
clock_gettime(CLOCK_MONOTONIC_RAW, &t_m_2);

where "marker" is a descriptor for
"/sys/kernel/debug/tracing/trace_marker". In other words, I generated a
unique event which will be timestamped by sched_clock. Then I averaged
t_m_1 and t_m_2 to get t_m_x, translated it into t_s_x with the formula
above and compared it with a timestamp for the marked, grepped from the
"/sys/kernel/debug/tracing/trace" (as it is directly injected into the
ftrace buffer, there is no way of getting it into perf data stream). On
average I was getting results accurate to or below 1 microsecond (that's
the resolution of a timestamp generated by standard ftrace output).

So, the approach works in general, however:

1. I guess adding monotonic raw field to the "vanilla" sched_switch
won't be praised... A clone like the above one doesn't make much sense
either, I think.

2. Even if this was possible (or we had a clone like the above), only
root can request trace events. This is acceptable for system-wide
profiling tools (which will run as root by definition), but can be a
problem for JIT engines, which usually run as unprivileged processes.

3. The accuracy, although not bad and sufficient in a lot of use cases
(eg. latencies when accessing hwmon sensors are big and unpredictable by
their nature, so a microsecond here and there don't matter at all), may
not be good enough for JIT engines - there must be a way of precisely
marking a moment between a new code (and debug information) being
generated and its first instruction executed. "trace_marker" could be
used for this, but as mentioned above, it's "invisible" to perf, and
even if it was, it would be root-only, see p.2.

With all this in mind, I though it would be possible to generalise the
"monotonic timestamping" at the perf interface level and duplicating the
marker idea there... A short RFC series to follow.

Pawel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/