Re: [RFC][PATCH] ring-buffer: Have nested events still record running time stamp

From: Steven Rostedt
Date: Fri Jun 26 2020 - 14:58:27 EST


On Fri, 26 Jun 2020 14:13:50 -0400 (EDT)
Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote:

> >
> > static inline bool __rb_time_read(rb_time_t *t, u64 *ret, unsigned long *cnt)
> > {
> > unsigned long top, bottom;
> >
> > top = local_read(&t->top);
> > bottom = local_read(&t->bottom);
> >
> > *cnt = rb_time_cnt(top);
> >
> > if (*cnt != rb_time_cnt(bottom))
> > return false;
> >
> > *ret = rb_time_val(top, bottom);
> > return true;
> > }

BTW, I've decided to include this in my final patch series which I'm
finishing up now.

>
> If __rb_time_read or rb_time_cmpxchg are used in an interrupt over
> rb_time_set (between setting top and bottom), those will never succeed.
>
> How is this case handled ?
>

The three places using cmpxchg:

In the discard code (which is only called when nesting already happened
as it uses per_cpu buffers first, as they are faster), thus this
requires a double nest:


if (bpage->page == (void *)addr && rb_page_write(bpage) == old_index) {
unsigned long write_mask =
local_read(&bpage->write) & ~RB_WRITE_MASK;
unsigned long event_length = rb_event_length(event);

/* Something came in, can't discard */
if (!rb_time_cmpxchg(&cpu_buffer->write_stamp,
write_stamp, write_stamp - delta))
return 0;


It fails to do the discard and fills the buffer with "padding".

Second usage:

/* SLOW PATH - Interrupted between A and C */
a_ok = rb_time_read(&cpu_buffer->write_stamp, &after);
ts = rb_time_stamp(cpu_buffer->buffer);
barrier();
/*E*/ if (write == (local_read(&tail_page->write) & RB_WRITE_MASK) &&
a_ok && after < ts) {
/* Nothing came after this event between C and E */
info->delta = ts - after;
(void)rb_time_cmpxchg(&cpu_buffer->write_stamp, after, info->ts);
info->ts = ts;
} else {
info->delta = 0;

Not only did this event interrupt something, but something interrupt
this event. Punt and just use the delta zero.

Third usage:

if (unlikely(info->ts != save_before)) {
/* SLOW PATH - Interrupted between C and E */

a_ok = rb_time_read(&cpu_buffer->write_stamp, &after);
RB_WARN_ON(cpu_buffer, !a_ok);

/* Write stamp must only go forward */
if (save_before > after) {
/*
* We do not care about the result, only that
* it gets updated atomically.
*/
(void)rb_time_cmpxchg(&cpu_buffer->write_stamp, after, save_before);
}
}

It's where we want to update the write stamp to the latest, and which
it already wrote to the write_stamp. Notice there's even a warning that
will cause the ring buffer to crash (fail) if it's not OK. Which I
triggered when I forgot to handle the case where reading could get
corrupted by being interrupted, which I fixed soon after ;-)

As for the save_before. This event writes to the before after reading
it, so the before_stamp had better be ok (and there's a warning there
too).

Here's the read locations:

In the discard as well. If it can't read it, it will just fill with padding.

if (!rb_time_read(&cpu_buffer->write_stamp, &write_stamp))
return 0;


On reserving the event:

barrier();
b_ok = rb_time_read(&cpu_buffer->before_stamp, &before);
a_ok = rb_time_read(&cpu_buffer->write_stamp, &after);
barrier();

[..]
if (unlikely(!a_ok || !b_ok || (before != after && w)))
info->add_timestamp |= RB_ADD_STAMP_FORCE | RB_ADD_STAMP_EXTEND;

It treats these failed reads as interrupting the event and uses the
full time stamp. Thanks to your suggestion about combining the the
case where I went back to the delta, I don't need these values anymore.

/* See if we shot pass the end of this buffer page */
if (unlikely(write > BUF_PAGE_SIZE)) {
if (tail != w) {
/* before and after may now different, fix it up*/
b_ok = rb_time_read(&cpu_buffer->before_stamp, &before);
a_ok = rb_time_read(&cpu_buffer->write_stamp, &after);
if (a_ok && b_ok && before != after)
(void)rb_time_cmpxchg(&cpu_buffer->before_stamp, before, after);
}
return rb_move_tail(cpu_buffer, tail, info);
}

This is when we overrun the sub buffer. This is more of a clean up.
because I found that because we always write to before_stamp, and then
go back to the page, it triggered it as being different. The only
reason we want them to be the same is to prevent a unnecessary time
stamp at the beginning of the page, which already stores a full time
stamp.

/*E*/ s_ok = rb_time_read(&cpu_buffer->before_stamp, &save_before);
RB_WARN_ON(cpu_buffer, !s_ok);

As mentioned, this before_stamp was written to by this event, and is
now used to know if it was interrupted since.

I think I hit all the locations, and that is how it handles not being
able to read it. The important thing is, it must be able to read it if
it was written by itself or an event that interrupted it. And it must
be able to know if it is invalid, and that can only happen if the event
interrupted the writing of it.

Cheers!

-- Steve