Re: [PATCH v1] perf doc: Document ring buffer mechanism

From: Leo Yan
Date: Wed Aug 02 2023 - 23:24:19 EST


On Mon, Jul 24, 2023 at 01:46:07PM -0700, Ian Rogers wrote:

[...]

> Picking up from here.
>
> > > +The mechanism of AUX ring buffer
> > > +--------------------------------
> > > +
> > > +In this chapter, we will explain the implementation of the AUX ring
> > > +buffer. In the first part it will discuss the connection between the
> > > +AUX ring buffer and the generic ring buffer, then the second part will
> > > +examine how the AUX ring buffer co-works with the generic ring buffer,
> > > +as well as the additional features introduced by the AUX ring buffer for
> > > +the sampling mechanism.
> > > +
> > > +The relationship between AUX and generic ring buffers
> > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > +
> > > +Generally, the AUX ring buffer is an auxiliary for the generic ring
> > > +buffer. The generic ring buffer is primarily used to store the event
> > > +samples, and every event fromat complies with the definition in the
>
> nit: s/fromat/format/
>
> > > +union perf_event; the AUX ring buffer is for recording the hardware
> > > +trace data, and the trace data format is hardware IP dependent. The
> > > +advantage of the AUX ring buffer is that it can de-couple the data
> > > +transferring between the generic perf events and the hardware tracing.
>
> I'm wondering if the wording of the last sentence can be made a little
> easier. Perhaps:
> The general use and advantage of the AUX ring buffer is that it
> written to directly by hardware rather than by the kernel. For
> example, regular profile samples that write to the generic ring buffer
> cause an interrupt. Tracing execution requires a high number of
> samples and using interrupts would be overwhelming for the generic
> ring buffer mechanism. Having an aux buffer allows for a region of
> memory more decoupled from the kernel and written to directly by
> hardware tracing.

Thanks for helping to rephrase. It's clear and good for me, will add
it into the next spin.

> > > +The AUX ring buffer reuses the same algorithm with the generic ring
> > > +buffer for the buffer management. The control structure
> > > +perf_event_mmap_page extends the new fields aux_head and aux_tail for
> > > +the head and tail pointers of the AUX ring buffer.
> > > +
> > > +During the AUX trace initialisation, record_opts::auxtrace_mmap_pages
> > > +is set for the AUX buffer size in page unit, otherwise, this option is
> > > +the default value '0' which means a perf session is not attached to any
> > > +AUX trace.
>
> This jumps into a bunch of perf tool details that it would be nice to
> discuss in more abstract terms. I think an important thing to mention
> is that unlike the mmap of the regular perf ring buffer, the aux mmap
> needs a second syscall.

Agreed. Combining with the followed sentence, I rephrased it as:

During the initialisation phase, besides the mmap()-ed regular ring
buffer, the perf tool invokes a second syscall in the
auxtrace_mmap__mmap() function for the mmap of the AUX buffer;
rb_alloc_aux() in the kernel allocates pages, these pages will be
deferred to map into VMA when handling the page fault, which is the
same lazy mechanism with the regular ring buffer.

> > > +
> > > +When record_opts::auxtrace_mmap_pages is a non-zero value, the
> > > +auxtrace_mmap__mmap() function invokes rb_alloc_aux() in the kernel for
> > > +allocating kernel pages; these pages will be deferred to map into VMA
> > > +when handling the page fault, which is the same lazy mechanism with the
> > > +generic ring buffer.
> > > +
> > > +The AUX event and AUX trace data are two different things. Likewise the
> > > +PMU events, the AUX event will be saved into the generic ring buffer
> > > +while the AUX trace data is stored in the AUX ring buffer. As a result,
> > > +the generic ring buffer and the AUX ring buffer are allocated in pairs,
> > > +even if only one hardware trace event is enabled.
>
> nit: s/The AUX event and/AUX events and/
>
> Would the hardware trace event be the aux event? Perhaps an example
> would be useful here.

Good point. I refined it as:

AUX events and AUX trace data are two different things. Let's see an
example:

perf record -a -e cycles -e cs_etm/@tmc_etr0/ -- sleep 2

The above command enables two events: one is the event 'cycles' from PMU
and another is the AUX event 'cs_etm' from Arm CoreSight, both events
are saved into the regular ring buffer while the CoreSight's trace data
is stored in the AUX ring buffer. As a result, the regular ring buffer
and the AUX ring buffer are allocated in pairs.

> > > +
> > > +Now let's see the AUX ring buffer deployment in the perf modes. For
> > > +per-thread mode, perf tool allocates only one generic ring buffer and one
> > > +AUX ring buffer for the whole session; for the system wide mode, perf
> > > +allocates the generic ring buffer and the AUX ring buffer per CPU wise.
>
> Perhaps mention per-thread with a CPU, as perf won't use per-thread
> mode without a command line argument.

Yeah, to make it clear, refined as:

Now, let's see the AUX ring buffer deployment in the perf modes. The
perf in default mode allocates the regular ring buffer and the AUX ring
buffer per CPU-wise, which is the same as the system wide mode, although
the default mode records samples only for the profiled program, and the
latter mode profiles for all programs in the system. For per-thread
mode, the perf tool allocates only one regular ring buffer and one AUX
ring buffer for the whole session. For the per-CPU mode, the perf
allocates two kinds of ring buffers for CPUs specified by the option
'-C'.

[...]

> > > +Once the hardware trace data is stored into AUX ring buffer, the
> > > +function perf_aux_output_end() finishes two things:
> > > +
> > > +- It fills an event PERF_RECORD_AUX into the generic ring buffer, this
> > > +event delivers the information of the start address and data size for a
> > > +chunk of hardware trace data has been stored into the AUX ring buffer;
> > > +
> > > +- Since the hardware trace driver has stored new trace data into the AUX
> > > +ring buffer, the argument 'size' indicates how many bytes have been
> > > +consumed by the hardware tracing, thus perf_aux_output_end() updates the
> > > +header pointer perf_buffer::aux_head to reflect the latest buffer usage.
> > > +
>
> Perhaps add a description of lost events?

Good point. I tweaked above sentences as:

"Once the hardware trace data is stored into the AUX ring buffer, the PMU
driver will stop hardware tracing by calling the pmu::stop() callback.
Similar to the regular ring buffer, the AUX ring buffer needs to apply
the memory synchronization mechanism as discussed in the section "Memory
synchronization". Since the AUX ring buffer is managed by the PMU
driver, the barrier (B), which is a writing barrier to ensure the trace
data is externally visible prior to updating the head pointer, is asked
to be implemented in the PMU driver.

Then pmu::stop() can safely call the perf_aux_output_end() function to
finish two things:

...

At the end, the PMU driver will restart hardware tracing. During this
temporary suspending period, it will lose hardware trace data, which
will introduce a discontinuity during decoding phase."

[...]

> Thanks again for all of this, sorry for the delay in the 2nd part of my review.

Very appreciate your detailed review and many suggestions which helped
to improve this doc a lot!

Leo