Re: LTTng 0.146, adds extra read-side sub-buffer for flightrecorder

From: Mathieu Desnoyers
Date: Tue Jul 14 2009 - 18:57:40 EST


* Steven Rostedt (rostedt@xxxxxxxxxxx) wrote:
>
>
> On Mon, 13 Jul 2009, Mathieu Desnoyers wrote:
>
> > Hi,
> >
> > So, I needed a weekend break from writing my thesis (It's almost over!) ;)
> > and I had the great idea to try to come up with a way to ensure that
> > LTTng flight recorder mode permits to have a read-side that never sees
> > corrupted data.
> >
> > Basically, this is the main thing Steven have been asking me for a
> > while. And it looks like I just figured out a way to do it.
> >
> > So for flight recorder tracing, this new LTTng version allocates an
> > extra subbuffer which gets exchanged by the reader with the writer
> > subbuffer before it gets read.
> >
> > Normal tracing does not need this extra subbuffer, because the
> > write-side just drops events when the buffer is full. So we don't
> > allocate it and we don't perform any exchange. The space
> > reservation/commit code plays nicely with both flight recorder and
> > normal tracing schemes.
> >
> > Here is how I did it:
> >
> > No modification was required to the buffer space reservation/commit
> > algorithm. I just had to do the following at the backend level
> > (responsible for writing data to/reader data from the buffer):
> >
> > I am using an array of pointers (one pointer for each subbuffer), plus a
> > pointer to the reader subbuffer. Each of these pointers are pointing to
> > an array of pages, which are all the pages that constitute a subbuffer.
> > Reads/writes from/to the buffer are done by accessors which pick up the
> > right page location within this page table. By modifying the top-level
> > subbuffer pointer, we can swap a whole subbuffer in a single operation.
> >
> > There is a trick to deal with concurrency between writer and reader.
> > When the top-level subbuffer pointers are not used (no writer is
> > currently writing into it, no reader is reading from its subbuffer), we
> > set a RCHAN_NOREF_FLAG (value: 0x1) which indicates that no reference is
> > currently taken to this subbuffer. As long as this flag is set in the
> > pointer, it is safe for the reader to exchange it. When the writer needs
> > to access this subbuffer for writing, it clears the flag, and sets it
> > back after committing the last piece of data to it.
> >
> > When the reader figures out that the write-side subbuffer it is trying
> > to exchange has a reference, it fails with -EAGAIN.
> >
> > Nice things about the way I do it here:
> >
> > - I keep the separation between the space reservation layer and back-end
> > buffer layer. The extra reader subbuffer exchange is done at the
> > back-end layer. The reason why it took me so long to try to come up
> > with something is that I tried to do it at the space reservation
> > layer, which was not fitting well the space reservation semantics.
> >
> > - Keeping space reservation and physical buffer management separate
> > helps splitting complexity into sub-layers easier to verify.
> >
> > - Given the space reservation/commit is separate from the subbuffer
> > exchange per se, I don't need any special-cases for "if the tail
> > pointer is in the reader page".... these things never happen because
> > the reserve, commit and consumed counts are completely unrelated to
> > the pointers to physical subbuffers.
>
> I don't yet have time to read the patches (not this week, anyway), but I'm
> assuming that you can only get the new page (swap) while a writer is not
> writing to it. Thus if it is not a full page, then you must either copy
> the data, or swap out a non full page. Not complaining here, just trying
> to understand it :-)

Yep, it's a requirement that when a subbuffer is being written to, it's
not possible for the reader to exchange it, so it's impossible to read
it. It simplifies a lot of things.

>
> Thus the trick is that you have a series of pointers to the data, and you
> swap out the data and not the list? Hmm, actually the ring buffer is
> already like that and I probably could do the same.

There is no list involved per se. I exchange the pointers to these
structures, not the data itself.

Let's say I have 2 sub-buffers for the writer and one extra sub-buffer
for the reader. I would have:

- An array of 2 pointers to sub-buffer structures. (owned by the writer)
- 1 pointer to subbuffer structure (owned by the reader).

>
> Here's another thing that the ring buffer does (and makes things a little
> complex too) is that it keeps track of the number of entries in the buffer
> as well as the number of overruns. The number of entries in the page is
> kept in the list data and not the data page itself.

I could add these counters to the "sub-buffer structure". They are not
part of the data itself. When I exchange the top-level pointers to these
structures, the counters, which are part of these structures, will
follow.

>
> Using a special flag to switch out the data instead of breaking the link
> list may make things much simpler.

Yeah :) That's why I've chosen to use such flag. And using a linked list
seems like overly complex compared to the simple 2-levels page table I
use here.

>
> Hmm, I'll take a round to make the ring buffer closer to what you have
> done. At this rate, we may finally merge the two to handle things that we
> both need ;-)

Hopefully :) Please don't hesitate for more info if you need some.

I'll have to go back to my thesis next week, but hopefully within 2 more
weeks I should be almost done and more available.

Mathieu

>
> -- Steve
>
>
> >
> > As always, the tree is available at:
> >
> > http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git
> > git://git.kernel.org/pub/scm/linux/kernel/git/compudj/linux-2.6-lttng.git
> >
> > The commits implementing this the extra reader page for the lockless
> > scheme are:
> >
> > lttng-relay-per-subbuffer-index.patch
> > lttng-relay-per-subbuffer-index-low-bit-noref.patch
> > lttng-relay-lockless-writer-use-noref-flag.patch
> > lttng-relay-default-sb-index-to-noref.patch
> > lttng-relay-lockless-exchange-reader-writer-pages.patch
> >
> > Comments are welcome,
> >
> > Thanks,
> >
> > Mathieu
> >
> > --
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
> >

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/