Re: Nested events with zero deltas, can use absolute timestamps instead?

From: Jason Behmer
Date: Mon Apr 01 2019 - 18:54:39 EST


On Fri, Mar 29, 2019 at 9:53 AM Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
>
> On Fri, 29 Mar 2019 09:30:36 -0700
> Jason Behmer <jbehmer@xxxxxxxxxx> wrote:
>
> > Hi Steven and Tom,
> > I recently ran into a problem with correlating timestamps across CPUs and
> > finally was able to track it down to the fact that nested events are given
> > zero deltas, and so are placed earlier in time than they actually occurred.
> >
> > I'm on an older version and wanted to check if this problem still exists on
> > head and it appears that it does, but I also ran into Tom's commits adding
> > the ability to use absolute timestamps instead of deltas. It seems like
> > this option could fix the problem.
> >
> > It seems like ideally nested events could just always use these absolute
> > timestamps instead of zero deltas, as it appears the implementation of
> > absolute timestamps is flexible enough to have a mix of deltas and
> > absolutes in the stream of events. A fix that would require less work
> > would be for me to just turn on absolute timestamps for my entire trace,
> > but I see that the file to do that isn't yet writable to support that.
> >
> > Before trying to do either of these things I wanted to run it by both of
> > you to see if there's something I'm missing here. What do you think?
> >
>
> Yes this is an issue I need to fix. I have some ideas on how to handle
> this. I may need to go about looking into them again.
>
> It may be possible to add an option to have all nested events use the
> full timestamp as well.
>
> I'll have to think about this a bit more.
>
> -- Steve

The concurrency model is still a little bit unclear to me as I'm new
to this codebase. So I'm having some trouble reasoning about what
operations are safe at one point on the ring buffer. It seems like
we can't be preempted in general, just interrupts? And the events for
the events emitted by interrupts will be fully processed before
getting back to the event pointed at by the commit pointer? If this
is true I think the approach below (and prototyped in the attached
patch against head) might work and would love feedback. If not, this
problem is way harder.

We detect nested events by checking our event pointer against the
commit pointer. This is safe because we reserve our event space
atomically in the buffer, leading to an ordering of events we can
depend on. But to add a TIME_STAMP event we need to reserve more
space before we even have an event pointer, so we need to know
something about the ordering of events before we've actually
atomically reserved ours. We could check if the write pointer is set
to the commit pointer, and if it isn't we know we're a nested event.
But, someone could update the write pointer and/or commit pointer
between the time we check it and the time we atomically reserve our
space in the buffer. However, I think maybe this is ok.

If we see that the write pointer is not equal to the commit pointer,
then we're in an interrupt, and the only thing that could update the
commit pointer is the original event emitting code that was
interrupted, which can't run again until we're finished. And the only
thing that can update the write pointer is further interrupts of us,
which will advance the write pointer further away from commit, leaving
our decision to allocate a TIME_STAMP event as valid.

If we see that the write pointer is equal to the commit pointer, then
anything that interrupts us before we move the write pointer will see
that same state and will need to, before returning to us, commit their
event and set commit to their new write pointer, which will make our
decision valid once again.

There's a lot of assumptions in there that I'd love to be checked on
as I'm new to this code base. For example I haven't read the read
path at all and have no idea if it interacts with this at all.

Thanks,
Jason
From 4be4e74032336522691bf2e751e40ae679cc885b Mon Sep 17 00:00:00 2001
From: Jason Behmer <jbehmer@xxxxxxxxxx>
Date: Mon, 1 Apr 2019 15:15:13 -0700
Subject: [PATCH] Adding TIME_STAMP events to all nested events.

---
kernel/trace/ring_buffer.c | 96 +++++++++++++++++++++++++++-----------
1 file changed, 70 insertions(+), 26 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 41b6f96e5366..07a408ed7248 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -425,7 +425,8 @@ struct rb_event_info {
u64 delta;
unsigned long length;
struct buffer_page *tail_page;
- int add_timestamp;
+ bool add_time_extend;
+ bool add_time_stamp;
};

/*
@@ -2313,15 +2314,12 @@ rb_move_tail(struct ring_buffer_per_cpu *cpu_buffer,

/* Slow path, do not inline */
static noinline struct ring_buffer_event *
-rb_add_time_stamp(struct ring_buffer_event *event, u64 delta, bool abs)
+rb_add_time_extend(struct ring_buffer_event *event, u64 delta)
{
- if (abs)
- event->type_len = RINGBUF_TYPE_TIME_STAMP;
- else
- event->type_len = RINGBUF_TYPE_TIME_EXTEND;
+ event->type_len = RINGBUF_TYPE_TIME_EXTEND;

/* Not the first event on the page, or not delta? */
- if (abs || rb_event_index(event)) {
+ if (rb_event_index(event)) {
event->time_delta = delta & TS_MASK;
event->array[0] = delta >> TS_SHIFT;
} else {
@@ -2333,6 +2331,21 @@ rb_add_time_stamp(struct ring_buffer_event *event, u64 delta, bool abs)
return skip_time_extend(event);
}

+/* Slow path, do not inline */
+static noinline struct ring_buffer_event *
+rb_add_time_stamp(struct ring_buffer_event *event, u64 timestamp)
+{
+ event->type_len = RINGBUF_TYPE_TIME_STAMP;
+ /*
+ * Store a full timestamp in the time_delta field and the first data
+ * element.
+ */
+ event->time_delta = timestamp & TS_MASK;
+ event->array[0] = timestamp >> TS_SHIFT;
+
+ return skip_time_extend(event);
+}
+
static inline bool rb_event_is_commit(struct ring_buffer_per_cpu *cpu_buffer,
struct ring_buffer_event *event);

@@ -2360,15 +2373,17 @@ rb_update_event(struct ring_buffer_per_cpu *cpu_buffer,
delta = 0;

/*
- * If we need to add a timestamp, then we
+ * If we need to add a TIME_EXTEND or TIME_STAMP, then we
* add it to the start of the reserved space.
*/
- if (unlikely(info->add_timestamp)) {
- bool abs = ring_buffer_time_stamp_abs(cpu_buffer->buffer);
-
- event = rb_add_time_stamp(event, info->delta, abs);
+ if (unlikely(info->add_time_extend)) {
+ event = rb_add_time_extend(event, info->delta);
length -= RB_LEN_TIME_EXTEND;
delta = 0;
+ } else if (unlikely(info->add_time_stamp)) {
+ event = rb_add_time_stamp(event, info->delta);
+ length -= RB_LEN_TIME_STAMP;
+ delta = 0;
}

event->time_delta = delta;
@@ -2520,6 +2535,14 @@ rb_set_commit_to_write(struct ring_buffer_per_cpu *cpu_buffer)
goto again;
}

+static __always_inline bool
+rb_commit_is_write(struct ring_buffer_per_cpu *cpu_buffer)
+{
+ return cpu_buffer->commit_page == cpu_buffer->tail_page &&
+ rb_commit_index(cpu_buffer) == rb_page_write(cpu_buffer->commit_page);
+
+}
+
static __always_inline void rb_end_commit(struct ring_buffer_per_cpu *cpu_buffer)
{
unsigned long commits;
@@ -2805,19 +2828,23 @@ EXPORT_SYMBOL_GPL(ring_buffer_unlock_commit);

static noinline void
rb_handle_timestamp(struct ring_buffer_per_cpu *cpu_buffer,
- struct rb_event_info *info)
+ struct rb_event_info *info, int abs)
{
WARN_ONCE(info->delta > (1ULL << 59),
- KERN_WARNING "Delta way too big! %llu ts=%llu write stamp = %llu\n%s",
+ KERN_WARNING "Delta way too big! %llu ts=%llu write stamp = %llu abs=%d\n%s",
(unsigned long long)info->delta,
(unsigned long long)info->ts,
(unsigned long long)cpu_buffer->write_stamp,
+ abs,
sched_clock_stable() ? "" :
"If you just came from a suspend/resume,\n"
"please switch to the trace global clock:\n"
" echo global > /sys/kernel/debug/tracing/trace_clock\n"
"or add trace_clock=global to the kernel command line\n");
- info->add_timestamp = 1;
+ if (abs)
+ info->add_time_stamp = true;
+ else
+ info->add_time_extend = true;
}

static struct ring_buffer_event *
@@ -2828,13 +2855,20 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
struct buffer_page *tail_page;
unsigned long tail, write;

- /*
- * If the time delta since the last event is too big to
- * hold in the time field of the event, then we append a
- * TIME EXTEND event ahead of the data event.
- */
- if (unlikely(info->add_timestamp))
+ if (unlikely(info->add_time_extend))
+ /*
+ * If the time delta since the last event is too big to
+ * hold in the time field of the event, then we append a
+ * TIME EXTEND event ahead of the data event.
+ */
info->length += RB_LEN_TIME_EXTEND;
+ else if (unlikely(info->add_time_stamp))
+ /*
+ * If using absolute time stamps for this event, then we append
+ * a TIME STAMP event ahead of the data event to fit the full
+ * timestamp.
+ */
+ info->length += RB_LEN_TIME_STAMP;

/* Don't let the compiler play games with cpu_buffer->tail_page */
tail_page = info->tail_page = READ_ONCE(cpu_buffer->tail_page);
@@ -2848,7 +2882,7 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
* If this is the first commit on the page, then it has the same
* timestamp as the page itself.
*/
- if (!tail && !ring_buffer_time_stamp_abs(cpu_buffer->buffer))
+ if (!tail && !unlikely(info->add_time_stamp))
info->delta = 0;

/* See if we shot pass the end of this buffer page */
@@ -2904,7 +2938,8 @@ rb_reserve_next_event(struct ring_buffer *buffer,

info.length = rb_calculate_event_length(length);
again:
- info.add_timestamp = 0;
+ info.add_time_extend = false;
+ info.add_time_stamp = false;
info.delta = 0;

/*
@@ -2927,19 +2962,28 @@ rb_reserve_next_event(struct ring_buffer *buffer,

if (ring_buffer_time_stamp_abs(buffer)) {
info.delta = info.ts;
- rb_handle_timestamp(cpu_buffer, &info);
+ rb_handle_timestamp(cpu_buffer, &info, true);
+ } else if (!rb_commit_is_write(cpu_buffer)) {
+ /*
+ * Use absolute timestamps for nested events, where we can't
+ * compute a delta.
+ */
+ info.delta = info.ts;
+ rb_handle_timestamp(cpu_buffer, &info, true);
} else /* Did the write stamp get updated already? */
if (likely(info.ts >= cpu_buffer->write_stamp)) {
info.delta = diff;
if (unlikely(test_time_stamp(info.delta)))
- rb_handle_timestamp(cpu_buffer, &info);
+ rb_handle_timestamp(cpu_buffer, &info, false);
}

event = __rb_reserve_next(cpu_buffer, &info);

if (unlikely(PTR_ERR(event) == -EAGAIN)) {
- if (info.add_timestamp)
+ if (info.add_time_extend)
info.length -= RB_LEN_TIME_EXTEND;
+ else if (info.add_time_stamp)
+ info.length -= RB_LEN_TIME_STAMP;
goto again;
}

--
2.21.0.392.gf8f6787159e-goog