[PATCH 2/2] More precise timestamps for nested writes

From: Suresh E. Warrier
Date: Mon Apr 13 2015 - 22:38:20 EST


When tracing the behavior of multiple fio jobs running in parallel
our performance team observed that some scsi_dispatch_cmd_done events
appeared to occur earlier, often several microseconds earlier, than
their associated scsi_dispatch_cmd_start event in the trace records.
Other interrupt events were also observed to have earlier timestamps
than the events that caused them.

This incorrect chronological ordering of trace records occurs because
ALL nested writes have the same time stamp as the first writer, that is
the first writer in the stack that was preempted. In workloads where
interrupts occur frequently, the first writer can stay preempted across
multiple interrupts and nested trace events can record time stamps that
are many microseconds earlier than their actual value resulting in
the wrong ordering.

For example, a scsi_dispatch_cmd_start on CPU A occurs, say at time
t1, and the corresponding scsi_dispatch_cmd_done occurs on another
CPU B, say at time t2. However, CPU B is in the middle of a nested
write with the first writer having recorded its event at time t0, where
t0 < t1 < t2. Eventually, we get out of all the nested writes and
the first writer commits its trace event with a time of t0 and the
the nested scsi_dispatch_cmd_done gets assigned the same timestamp t0.
In the trace report, the scsi_dispatch_cmd_done thus appears to have
occurred before the scsi_dispatch_cmd_start.

In some cases, on larger systems with multiple fio jobs running and
interrupts being sent to a single CPU, we have noticed more than
400 trace events all with the same timestamp.

The reason all nested writes have the same time stamp is because
all nested trace events are assigned a time delta of 0. A correct
time delta cannot be computed for them because the first interrupted
write has not been committed yet and so the commit timestamp in the CPU
buffer does not record the time stamp of the just prior event.

One way of fixing this is to keep a timestamp of the last event we
reserved space for in the ring buffer, so that we can compute the
correct time delta for each event. We also need a short critical
section where we cannot be interrupted, in which we:
1. Read the current time
2. Compute the time delta for this event from the last time stamp
3. Update the value of the last time stamp
4. Reserve space in the ring buffer for the event,

Although the ring buffer design is carefully coded to avoid disabling
interrupts, in this case there does not appear to be a practical way
to solve this problem without disabling interrupts for a short time.
To accommodate those architectures where disabling interrupts is
expensive, a kernel tunable (/sys/kernel/debug/tracing/nested_precise_ts)
needs to be set to 1 to enable this behavior and get accurate
timestamps.

Signed-off-by: Suresh Warrier <warrier@xxxxxxxxxxxxxxxxxx>
---
kernel/trace/ring_buffer.c | 71 +++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 67 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index c9b3005..0a2d862 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -482,6 +482,7 @@ struct ring_buffer_per_cpu {
unsigned long read_bytes;
u64 write_stamp;
u64 read_stamp;
+ u64 last_stamp;
/* ring buffer pages to update, > 0 to add, < 0 to remove */
int nr_pages_to_update;
struct list_head new_pages; /* new pages to add */
@@ -2010,7 +2011,8 @@ rb_update_event(struct ring_buffer_per_cpu *cpu_buffer,
{
/* Only a commit updates the timestamp */
if (unlikely(!rb_event_is_commit(cpu_buffer, event)))
- delta = 0;
+ if (!rb_precise_nested_write_ts())
+ delta = 0;

/*
* If we need to add a timestamp, then we
@@ -2534,6 +2536,60 @@ void rb_disable_precise_nested_write_ts(void)
static_key_slow_dec(&__precise_nested_write_ts);
}

+/**
+ * get_write_timestamp:
+ * Must be called before we read the current time stamp
+ * It returns a pointer to the location of the last
+ * time stamp to be used in the delta calculation.
+ * If precise nested write timestamps are enabled, it first
+ * disables interrupts on the current processor so that
+ * we can reserve space on the buffer and save the event's
+ * timestamp without being preempted.
+ *
+ * put_write_timestamp:
+ * Must be called after we reserve space in the ring buffer.
+ * If precise nested write timestamps are enabled, it saves the
+ * timestamp in the specified timestamp location (passed in
+ * as *pstamp) so that the nested writers always have a valid
+ * timestamp to compute the timestamp deltas for their events.
+ * This must be done before we re-enable interrupts.
+ *
+ */
+static u64 *get_write_timestamp(struct ring_buffer_per_cpu *cpu_buffer,
+ unsigned long *flags)
+{
+ if (rb_precise_nested_write_ts()) {
+ /*
+ * Ensure that we are not preempted until after we update
+ * the write timestamp.
+ */
+ local_irq_save(*flags);
+ return &cpu_buffer->last_stamp;
+ } else {
+ return &cpu_buffer->write_stamp;
+ }
+}
+
+static void put_write_timestamp(struct ring_buffer_per_cpu *cpu_buffer,
+ u64 *pstamp,
+ u64 ts,
+ unsigned long *flags)
+{
+ /*
+ * Rather than make the rb_precise_nested_write_ts()
+ * call (this would be a NOP for the default case) to check
+ * if we need to restore interrupts, we need to explicitly
+ * test if pstamp points to the last_stamp field below to
+ * handle the case where the nested_precise_ts sysfs tunable
+ * was updated between the get_write_timestamp() and
+ * put_write_timestamp() routines.
+ */
+ if (&cpu_buffer->last_stamp == pstamp) {
+ *pstamp = ts;
+ local_irq_restore(*flags);
+ }
+}
+
static struct ring_buffer_event *
rb_reserve_next_event(struct ring_buffer *buffer,
struct ring_buffer_per_cpu *cpu_buffer,
@@ -2544,6 +2600,8 @@ rb_reserve_next_event(struct ring_buffer *buffer,
int nr_loops = 0;
int add_timestamp;
u64 diff;
+ u64 *pstamp;
+ unsigned long flags = 0; /* GCC */

rb_start_commit(cpu_buffer);

@@ -2579,14 +2637,16 @@ rb_reserve_next_event(struct ring_buffer *buffer,
if (RB_WARN_ON(cpu_buffer, ++nr_loops > 1000))
goto out_fail;

+ pstamp = get_write_timestamp(cpu_buffer, &flags);
+
ts = rb_time_stamp(cpu_buffer->buffer);
- diff = ts - cpu_buffer->write_stamp;
+ diff = ts - *pstamp;

/* make sure this diff is calculated here */
barrier();

/* Did the write stamp get updated already? */
- if (likely(ts >= cpu_buffer->write_stamp)) {
+ if (likely(ts >= *pstamp)) {
delta = diff;
if (unlikely(test_time_stamp(delta))) {
int local_clock_stable = 1;
@@ -2597,7 +2657,7 @@ rb_reserve_next_event(struct ring_buffer *buffer,
KERN_WARNING "Delta way too big! %llu ts=%llu write stamp = %llu\n%s",
(unsigned long long)delta,
(unsigned long long)ts,
- (unsigned long long)cpu_buffer->write_stamp,
+ (unsigned long long)*pstamp,
local_clock_stable ? "" :
"If you just came from a suspend/resume,\n"
"please switch to the trace global clock:\n"
@@ -2608,6 +2668,9 @@ rb_reserve_next_event(struct ring_buffer *buffer,

event = __rb_reserve_next(cpu_buffer, length, ts,
delta, add_timestamp);
+
+ put_write_timestamp(cpu_buffer, pstamp, ts, &flags);
+
if (unlikely(PTR_ERR(event) == -EAGAIN))
goto again;

--
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/