Re: [PATCH v1] perf doc: Document ring buffer mechanism

From: Leo Yan
Date: Wed Aug 02 2023 - 13:00:51 EST


Hi Ian,

On Mon, Jul 17, 2023 at 02:27:11PM -0700, Ian Rogers wrote:

[...]

> > Thanks Ian's suggestion for upstreaming this documentation into Linux
> > source code tree at the Jan of this year, also thanks my son Rowan for
> > correcting me some grammer errors in this summer holiday. Finally I
> > heavily refactored this documentation and I think it's ready for
> > reviewing.
>
> This is awesome, thanks Leo for taking the time to do this! I've done
> some readability nits

Thanks a lot for review, Ian.

I agreed with most of your comments. Blow I just keep the comments for
further discussion or give supplements, otherwise, I remove the rest
comments and will reflect them in my new patch.

[...]

> > +Perf uses the same way to manage its ring buffer. In implementation
> > +there have two actors: a page contains a control structure and ring
> > +buffer; the page containing the control structure is named as "user
> > +page", the user page and the ring buffer are mapped into user space
> > +with the continuous virtual address, since the user page and the ring
> > +buffer are consecutive, from the control structure pointer it's
> > +easily to know the ring buffer address.
>
> nit: Perhaps reword a little as:
> Perf uses the same way to manage its ring buffer. In the
> implementation there are two key data structures held together in a
> set of consecutive pages, the control structure and then the ring
> buffer itself. The page with the control structure in is known as the
> "user page". Being held in continuous virtual addresses simplifies
> locating the ring buffer address, it is in the pages after the page
> with the user page.
>
> Off-topic: seems wasteful to allocate a full page for this.

Arm CPUs support not only 4KiB page size, but also support 16KiB/64KiB
page size, it does waste a bit memory for using a page for control
structure. But given the control structure need to be mapped to user
space in page size unit, seems here have no room to optimize it.

[...]

> > + user page ring buffer
> > + +---------+---------+ +---------------------------------------+
> > + |data_head|data_tail|...| | |***|***|***|***|***| | | |
> > + +---------+---------+ +---------------------------------------+
> > + ` `--------------^ ^
> > + `--------------------------------------------|
> > +
> > + * : the data is filled by the writer.
> > + Figure 2: Perf ring buffer
> > +
> > +When using 'perf record' tool, we can specify the ring buffer size with
>
> nit: s/using/using the/
>
> > +option '-m' or '--mmap-pages=', finally the size will be rounded up to
>
> nit: s/finally the size/the given size/
>
> > +power of two with page unit. Though the kernel allocates at once for
>
> nit: s/power of two with page unit/a power of two that is a multiple
> of a page size/
>
> Off-topic-ish: Perhaps it is worth motivating why the page size must
> be a power of 2. I'm guessing it is because this means that when
> moving the pointers/indices they can be masked to cause wrapping,
> rather than using division/modulus.

Correct. The comments in kernel/events/core.c verifies this guessing:

"If we have rb pages ensure they're a power-of-two number, so we
can do bitmasks instead of modulo."

> Fwiw, I think this could also be
> solved with a comparison, and the performance overhead could be
> insignificant compared to the memory savings.

I am not sure if a comparison is sufficient. As you said, if the page
number is not a power-of-two number, and page index increases
monotonically, we need to compute modulus to get the offset.

Agreed with you, we can consider to remove the limitaion of the page
number must be power-of-two for memory saving and check what's the
performance penalty.

> > +all memory pages, it's deferred to map the pages to VMA area until
> > +the perf tool accesses the buffer from the user space. In other words,
> > +at the first time accesses the buffer's page from user space in the perf
> > +tool, a data abort exception for page fault is taken and the kernel
> > +uses this occasion to map the page into process VMA, thus the perf tool
> > +can continue to access the page after returning from exception.
>
> nit: s/exception/the exception/
>
> Off topic: Should the perf tool use something like the MAP_POPULATE
> flag to reduce page faults, given the pages were allocated already in
> the kernel? Tweaking tools/lib/perf/mmap.c to do this and running
> "time perf record -a sleep 1" shows the minor page fault count going
> from 7700 to 9700, so it seems like a bad idea.

Hmm ... I got the opposite result, by observing page fault counting, I
can see the improvement with adding the MAP_POPULATE flag.

Before adding MAP_POPULATE flag:

# ./perf stat -- ./perf record -a sleep 1

5,359 page-faults # 2.409 K/sec
5,353 page-faults # 2.415 K/sec
5,353 page-faults # 2.425 K/sec

# ./perf stat -- ./perf record -e cs_etm/@tmc_etr0/ -a sleep 1

2,122 page-faults # 2.038 K/sec
2,121 page-faults # 2.001 K/sec
2,121 page-faults # 2.015 K/sec

After adding MAP_POPULATE flag:

# ./perf stat -- ./perf record -a sleep 1

5,004 page-faults # 2.260 K/sec
5,002 page-faults # 2.253 K/sec
5,003 page-faults # 2.239 K/sec

# ./perf stat -- ./perf record -e cs_etm/@tmc_etr0/ -a sleep 1

1,082 page-faults # 856.589 /sec
1,082 page-faults # 1.002 K/sec
1,080 page-faults # 1.013 K/sec

[...]

> > +System wide mode
> > +
> > +By default if without specifying mode, or explicitly using option '–a'
> > +or '––all–cpus', perf collects samples on all CPUs in the system wide
> > +mode.
> > +
> > +In this mode, every CPU has its own ring buffer; all threads are
> > +monitored during the running state and the samples are recorded into the
>
> This doesn't cover the case of:
> $ perf record benchmark
> Where there will be a ring buffer on every CPU but events/sampling
> will only be enabled for benchmark's threads, ie not all threads.

Correct. I wrongly thought it's the system-wide mode when don't
specify option '-a', will add a 'default mode' to address above case.

[...]

> > When a sample is recorded into the ring buffer, the kernel event
> > +core layer will wake up the perf process to read samples from the ring
> > +buffer.
>
> It isn't always necessary to wake the perf tool process. There is a
> little bit more detail on different polling modes in the
> perf_event_open man page in the section on "Overflow handling":
> https://github.com/mkerrisk/man-pages/blob/master/man2/perf_event_open.2#L3062

Exactly, perf ring buffer uses watermark as threshold, and only when
cross the threshold the kernel notifies the user space. I confirmed
this in the function __perf_output_begin().

So will rephrase as:

"When a sample is recorded into the ring buffer, and the number of
samples crossing the threshold, the kernel event core layer will wake up
the perf process to read samples from the ring buffer."

> > +
> > + Perf
> > + / | Read samples
> > + Polling / `--------------| Ring buffer
> > + v v ;-------------------v
> > + +----------------+ +---------+---------+ +-------------------+
> > + |Event wait queue| |data_head|data_tail| |***|***| | |***|
> > + +----------------+ +---------+---------+ +-------------------+
> > + ^ ^ `----------------------^
> > + | Wake up tasks | Store samples
> > + +-----------------------------+
> > + | Kernel event core layer |
> > + +-----------------------------+
> > +
> > + * : the data is filled by the writer.
> > + Figure 6: Writing and reading the ring buffer
> > +
> > +Because multiple events might share the same ring buffer for recording
> > +samples, when any event sample is stored into the ring buffer, the
> > +kernel event core layer iterates every event associated to the ring
> > +buffer and wake up tasks on the wait queue of the events. This is
> > +fulfilled by the kernel function ring_buffer_wakeup().
>
> I'm not sure about the use of "event" here. If you do:
> $ perf stat -e "cycles,instructions"
> Then cycles and instructions will each have a ring buffer, I'm not
> sure it is possible to get them to share a ring buffer. I think here
> you may be referring to side band events, like mmap2.

Seems to me, this is incorrect. Since 'perf stat' only read counters
(via the kernel function perf_read()), it doesn't allocate ring buffer
at all for events. By using GDB, I can confirm the function
perf_mmap__mmap() is never called for 'perf stat' command.

Just clarify, 'perf stat' still mmap the 'user page' for control
structure and for timer counter accessing, but this is not the same
thing with ring buffer.

[...]

> > +In Linux kernel, the event core layer uses the structure perf_buffer to
> > +track the buffer’s latest header, and it keeps the information for
> > +buffer pages. The structure perf_buffer manages ring buffer during its
> > +life cycle, it is allocated once the ring buffer is created and released
> > +when the ring buffer is destroyed.
> > +
> > +It’s possible for multiple events to write buffer concurrently. For
> > +instance, a software event and a hardware PMU event both are enabled for
> > +profiling, when the software event is in the middle of sampling, the
> > +hardware event maybe be overflow and its interrupt is triggered in this
> > +case. This leads to the race condition for updating perf_buffer::head.
> > +In __perf_output_begin(), Linux kernel uses compare-and-swap atomicity
> > +local_cmpxchg() to implement the lockless algorithm for protecting
> > +perf_buffer::head.
>
> In the case of:
> $ perf record -e "instructions,faults"
> faults is a software event but it won't share the ring buffer with
> instructions. I think the atomicity exists for the side band events.

I verified on my Arm64 board and confirmed that hardware events
(instructions, cycles) share the same ring buffer with software event
(faults). Below is the captured log (truncated for more readable):

...
=> __perf_sw_event
=> do_page_fault
=> do_translation_fault
...
ls-884 [003] ...2. 253.737217: perf_output_begin_forward: __perf_output_begin: rb=00000000720bac97 page=0 offset=1688 addr=0000000019392296 size=2408

...
=> __perf_event_overflow
=> perf_event_overflow
=> armv8pmu_handle_irq
...
ls-884 [003] d.h1. 253.737247: perf_output_begin_forward: __perf_output_begin: rb=00000000720bac97 page=0 offset=1736 addr=000000009e259b5a size=2360

We can see for both software event and Armv8 PMU event, both use the
same rb pointer (0x00000000720bac9), which means the software and
hardware events store samples into the same ring buffer. Thus the
above description is valid.

P.s. maybe the side band events are not a relevant topic, but I can
see the side band events have dedicated ring buffer.

Thanks,
Leo