Re: [RFC][PATCH] ring-buffer: Have nested events still record running time stamp
From: Steven Rostedt
Date: Thu Jun 25 2020 - 14:35:31 EST
On Thu, 25 Jun 2020 13:55:07 -0400 (EDT)
Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote
> > Here's the design of this solution:
> >
> > All this is per cpu, and only needs to worry about nested events (not
> > parallel events).
> >
> > The players:
> >
> > write_tail: The index in the buffer where new events can be written to.
> > It is incremented via local_add() to reserve space for a new event.
> >
> > before_stamp: A time stamp set by all events before reserving space.
> >
> > write_stamp: A time stamp updated by events after it has successfully
> > reserved space.
>
> What are the types used for before_stamp and write_stamp ? If those
> are 64-bit integers, how does sharing them between nested contexts
> work on 32-bit architectures ?
Well, write_stamp is updated via local64, which I believe handles this
for us. I probably should make before_stamp handle it as well.
>
> >
> > next_write: A copy of "write_tail" used to help with races.
> >
> > /* Save the current position of write */
> > [A] w = local_read(write_tail);
> > barrier();
> > /* Read both before and write stamps before touching
> > anything */ before = READ_ONCE(before_stamp);
> > after = local_read(write_stamp);
> > barrier();
> >
> > /*
> > * If before and after are the same, then this event is not
> > * preempting a time update. If it is, then reserve space
> > for adding
>
> You should use the term "interrupt" rather than "preempting", because
> as you stated yourself, this algorithm only works with nested
> interrupts, not with preemption.
The two terms are basically interchangeable here, but for consistency,
I will update it. Thanks.
>
> > * a full time stamp (this can turn into a time extend which
> > is
> > * just an extended time delta but fill up the extra space).
> > */
> > if (after != before)
> > abs = true;
> >
> > ts = clock();
> >
> > /* Now update the before_stamp (everyone does this!) */
> > [B] WRITE_ONCE(before_stamp, ts);
> >
> > /* Read the current next_write and update it to what we want
> > write
> > * to be after we reserve space. */
> > next = READ_ONCE(next_write);
> > WRITE_ONCE(next_write, w + len);
> >
> > /* Now reserve space on the buffer */
> > [C] write = local_add_return(len, write_tail);
>
> So the reservation is not "just" an add instruction, it's actually an
> xadd on x86. Is that really faster than a cmpxchg ?
I believe the answer is still yes. But I can run some benchmarks to
make sure.
>
> >
> > /* Set tail to be were this event's data is */
> > tail = write - len;
> >
> > if (w == tail) {
> >
> > /* Nothing preempted this between A and C */
> > [D] local_set(write_stamp, ts);
> > barrier();
> > [E] save_before = READ_ONCE(before_stamp);
> >
> > if (!abs) {
> > /* This did not preempt a time update */
> > delta = ts - a;
>
> What does "a" refer to ? What triggers its update ?
Oops, When I first wrote this, I used "a" and "b" for "after" and
"before" and had "after" and "before" be "after_stamp" and
"before_stamp". I missed this update. Nice catch.
>
> > } else {
> > /* slow path */
> > if (w == next) {
>
> If it's a slow path, why try to play games with a delta timestamp ?
> Space should not be an issue for a slow path like this. It would be
> simpler to just use the full timestamp here.
Hmm, you may be right. Previous iterations of this code had a distinct
difference here, but after restructuring it, I don't think that
distinction is valid anymore. If anything, having this at least lets me
validate that it's doing what I believe it should be doing (as I added
a ton of trace_printk()s into this).
>
> > /* This event preempted the previous
> > event
> > * just after it reserved its
> > buffer.
>
> You mean nesting after [C] but before [D].
Yes. I can add that for clarity, but perhaps I don't need that if I
merge the two.
>
> > * It's timestamp should be
> > "before_stamp" */
>
> It's -> Its
;-)
My email client messed up your formatting of the rest of the email, so
I'll send a separate reply.
-- Steve