Re: [PATCH v1] perf doc: Document ring buffer mechanism

From: Ian Rogers
Date: Wed Aug 02 2023 - 18:02:21 EST


On Wed, Aug 2, 2023 at 10:00 AM Leo Yan <leo.yan@xxxxxxxxxx> wrote:
>
> Hi Ian,
>
> On Mon, Jul 17, 2023 at 02:27:11PM -0700, Ian Rogers wrote:
>
> [...]
>
> > > Thanks Ian's suggestion for upstreaming this documentation into Linux
> > > source code tree at the Jan of this year, also thanks my son Rowan for
> > > correcting me some grammer errors in this summer holiday. Finally I
> > > heavily refactored this documentation and I think it's ready for
> > > reviewing.
> >
> > This is awesome, thanks Leo for taking the time to do this! I've done
> > some readability nits
>
> Thanks a lot for review, Ian.
>
> I agreed with most of your comments. Blow I just keep the comments for
> further discussion or give supplements, otherwise, I remove the rest
> comments and will reflect them in my new patch.
>
> [...]
>
> > > +Perf uses the same way to manage its ring buffer. In implementation
> > > +there have two actors: a page contains a control structure and ring
> > > +buffer; the page containing the control structure is named as "user
> > > +page", the user page and the ring buffer are mapped into user space
> > > +with the continuous virtual address, since the user page and the ring
> > > +buffer are consecutive, from the control structure pointer it's
> > > +easily to know the ring buffer address.
> >
> > nit: Perhaps reword a little as:
> > Perf uses the same way to manage its ring buffer. In the
> > implementation there are two key data structures held together in a
> > set of consecutive pages, the control structure and then the ring
> > buffer itself. The page with the control structure in is known as the
> > "user page". Being held in continuous virtual addresses simplifies
> > locating the ring buffer address, it is in the pages after the page
> > with the user page.
> >
> > Off-topic: seems wasteful to allocate a full page for this.
>
> Arm CPUs support not only 4KiB page size, but also support 16KiB/64KiB
> page size, it does waste a bit memory for using a page for control
> structure. But given the control structure need to be mapped to user
> space in page size unit, seems here have no room to optimize it.
>
> [...]
>
> > > + user page ring buffer
> > > + +---------+---------+ +---------------------------------------+
> > > + |data_head|data_tail|...| | |***|***|***|***|***| | | |
> > > + +---------+---------+ +---------------------------------------+
> > > + ` `--------------^ ^
> > > + `--------------------------------------------|
> > > +
> > > + * : the data is filled by the writer.
> > > + Figure 2: Perf ring buffer
> > > +
> > > +When using 'perf record' tool, we can specify the ring buffer size with
> >
> > nit: s/using/using the/
> >
> > > +option '-m' or '--mmap-pages=', finally the size will be rounded up to
> >
> > nit: s/finally the size/the given size/
> >
> > > +power of two with page unit. Though the kernel allocates at once for
> >
> > nit: s/power of two with page unit/a power of two that is a multiple
> > of a page size/
> >
> > Off-topic-ish: Perhaps it is worth motivating why the page size must
> > be a power of 2. I'm guessing it is because this means that when
> > moving the pointers/indices they can be masked to cause wrapping,
> > rather than using division/modulus.
>
> Correct. The comments in kernel/events/core.c verifies this guessing:
>
> "If we have rb pages ensure they're a power-of-two number, so we
> can do bitmasks instead of modulo."
>
> > Fwiw, I think this could also be
> > solved with a comparison, and the performance overhead could be
> > insignificant compared to the memory savings.
>
> I am not sure if a comparison is sufficient. As you said, if the page
> number is not a power-of-two number, and page index increases
> monotonically, we need to compute modulus to get the offset.
>
> Agreed with you, we can consider to remove the limitaion of the page
> number must be power-of-two for memory saving and check what's the
> performance penalty.
>
> > > +all memory pages, it's deferred to map the pages to VMA area until
> > > +the perf tool accesses the buffer from the user space. In other words,
> > > +at the first time accesses the buffer's page from user space in the perf
> > > +tool, a data abort exception for page fault is taken and the kernel
> > > +uses this occasion to map the page into process VMA, thus the perf tool
> > > +can continue to access the page after returning from exception.
> >
> > nit: s/exception/the exception/
> >
> > Off topic: Should the perf tool use something like the MAP_POPULATE
> > flag to reduce page faults, given the pages were allocated already in
> > the kernel? Tweaking tools/lib/perf/mmap.c to do this and running
> > "time perf record -a sleep 1" shows the minor page fault count going
> > from 7700 to 9700, so it seems like a bad idea.
>
> Hmm ... I got the opposite result, by observing page fault counting, I
> can see the improvement with adding the MAP_POPULATE flag.
>
> Before adding MAP_POPULATE flag:
>
> # ./perf stat -- ./perf record -a sleep 1
>
> 5,359 page-faults # 2.409 K/sec
> 5,353 page-faults # 2.415 K/sec
> 5,353 page-faults # 2.425 K/sec
>
> # ./perf stat -- ./perf record -e cs_etm/@tmc_etr0/ -a sleep 1
>
> 2,122 page-faults # 2.038 K/sec
> 2,121 page-faults # 2.001 K/sec
> 2,121 page-faults # 2.015 K/sec
>
> After adding MAP_POPULATE flag:
>
> # ./perf stat -- ./perf record -a sleep 1
>
> 5,004 page-faults # 2.260 K/sec
> 5,002 page-faults # 2.253 K/sec
> 5,003 page-faults # 2.239 K/sec
>
> # ./perf stat -- ./perf record -e cs_etm/@tmc_etr0/ -a sleep 1
>
> 1,082 page-faults # 856.589 /sec
> 1,082 page-faults # 1.002 K/sec
> 1,080 page-faults # 1.013 K/sec
>
> [...]

Interesting, I wonder if it is an ARM vs x86 difference?

> > > +System wide mode
> > > +
> > > +By default if without specifying mode, or explicitly using option '–a'
> > > +or '––all–cpus', perf collects samples on all CPUs in the system wide
> > > +mode.
> > > +
> > > +In this mode, every CPU has its own ring buffer; all threads are
> > > +monitored during the running state and the samples are recorded into the
> >
> > This doesn't cover the case of:
> > $ perf record benchmark
> > Where there will be a ring buffer on every CPU but events/sampling
> > will only be enabled for benchmark's threads, ie not all threads.
>
> Correct. I wrongly thought it's the system-wide mode when don't
> specify option '-a', will add a 'default mode' to address above case.
>
> [...]
>
> > > When a sample is recorded into the ring buffer, the kernel event
> > > +core layer will wake up the perf process to read samples from the ring
> > > +buffer.
> >
> > It isn't always necessary to wake the perf tool process. There is a
> > little bit more detail on different polling modes in the
> > perf_event_open man page in the section on "Overflow handling":
> > https://github.com/mkerrisk/man-pages/blob/master/man2/perf_event_open.2#L3062
>
> Exactly, perf ring buffer uses watermark as threshold, and only when
> cross the threshold the kernel notifies the user space. I confirmed
> this in the function __perf_output_begin().
>
> So will rephrase as:
>
> "When a sample is recorded into the ring buffer, and the number of
> samples crossing the threshold, the kernel event core layer will wake up
> the perf process to read samples from the ring buffer."

Sounds good.

> > > +
> > > + Perf
> > > + / | Read samples
> > > + Polling / `--------------| Ring buffer
> > > + v v ;-------------------v
> > > + +----------------+ +---------+---------+ +-------------------+
> > > + |Event wait queue| |data_head|data_tail| |***|***| | |***|
> > > + +----------------+ +---------+---------+ +-------------------+
> > > + ^ ^ `----------------------^
> > > + | Wake up tasks | Store samples
> > > + +-----------------------------+
> > > + | Kernel event core layer |
> > > + +-----------------------------+
> > > +
> > > + * : the data is filled by the writer.
> > > + Figure 6: Writing and reading the ring buffer
> > > +
> > > +Because multiple events might share the same ring buffer for recording
> > > +samples, when any event sample is stored into the ring buffer, the
> > > +kernel event core layer iterates every event associated to the ring
> > > +buffer and wake up tasks on the wait queue of the events. This is
> > > +fulfilled by the kernel function ring_buffer_wakeup().
> >
> > I'm not sure about the use of "event" here. If you do:
> > $ perf stat -e "cycles,instructions"
> > Then cycles and instructions will each have a ring buffer, I'm not
> > sure it is possible to get them to share a ring buffer. I think here
> > you may be referring to side band events, like mmap2.
>
> Seems to me, this is incorrect. Since 'perf stat' only read counters
> (via the kernel function perf_read()), it doesn't allocate ring buffer
> at all for events. By using GDB, I can confirm the function
> perf_mmap__mmap() is never called for 'perf stat' command.
>
> Just clarify, 'perf stat' still mmap the 'user page' for control
> structure and for timer counter accessing, but this is not the same
> thing with ring buffer.

Sorry, I meant record not stat.

> [...]
>
> > > +In Linux kernel, the event core layer uses the structure perf_buffer to
> > > +track the buffer’s latest header, and it keeps the information for
> > > +buffer pages. The structure perf_buffer manages ring buffer during its
> > > +life cycle, it is allocated once the ring buffer is created and released
> > > +when the ring buffer is destroyed.
> > > +
> > > +It’s possible for multiple events to write buffer concurrently. For
> > > +instance, a software event and a hardware PMU event both are enabled for
> > > +profiling, when the software event is in the middle of sampling, the
> > > +hardware event maybe be overflow and its interrupt is triggered in this
> > > +case. This leads to the race condition for updating perf_buffer::head.
> > > +In __perf_output_begin(), Linux kernel uses compare-and-swap atomicity
> > > +local_cmpxchg() to implement the lockless algorithm for protecting
> > > +perf_buffer::head.
> >
> > In the case of:
> > $ perf record -e "instructions,faults"
> > faults is a software event but it won't share the ring buffer with
> > instructions. I think the atomicity exists for the side band events.
>
> I verified on my Arm64 board and confirmed that hardware events
> (instructions, cycles) share the same ring buffer with software event
> (faults). Below is the captured log (truncated for more readable):
>
> ...
> => __perf_sw_event
> => do_page_fault
> => do_translation_fault
> ...
> ls-884 [003] ...2. 253.737217: perf_output_begin_forward: __perf_output_begin: rb=00000000720bac97 page=0 offset=1688 addr=0000000019392296 size=2408
>
> ...
> => __perf_event_overflow
> => perf_event_overflow
> => armv8pmu_handle_irq
> ...
> ls-884 [003] d.h1. 253.737247: perf_output_begin_forward: __perf_output_begin: rb=00000000720bac97 page=0 offset=1736 addr=000000009e259b5a size=2360
>
> We can see for both software event and Armv8 PMU event, both use the
> same rb pointer (0x00000000720bac9), which means the software and
> hardware events store samples into the same ring buffer. Thus the
> above description is valid.
>
> P.s. maybe the side band events are not a relevant topic, but I can
> see the side band events have dedicated ring buffer.

You're right, the code doing this is using ioctls in particular
PERF_EVENT_IOC_SET_OUTPUT:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/lib/perf/evlist.c?h=perf-tools-next#n517
The man page mentions "PERF_FLAG_FD_OUTPUT (broken since Linux
2.6.35)" which had confused me.

Thanks,
Ian

> Thanks,
> Leo