Re: [RFC PATCH v1 1/4] docs: perf: Add documentation for HiSilicon PMCU

From: Jie Zhan
Date: Fri Mar 24 2023 - 22:48:39 EST




On 24/03/2023 20:14, Jonathan Cameron wrote:
On Fri, 24 Mar 2023 17:32:15 +0800
Jie Zhan <zhanjie9@xxxxxxxxxxxxx> wrote:

On 17/03/2023 21:37, Jonathan Cameron wrote:
On Mon, 6 Feb 2023 14:51:43 +0800
Jie Zhan <zhanjie9@xxxxxxxxxxxxx> wrote:
Document the overview and usage of HiSilicon PMCU.

HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
PMU accesses from CPUs, handling the configuration, event switching, and
counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
and multi-PMU-event CPU profiling, in which scenario the current 'perf'
scheme may lose events or drop sampling frequency. With PMCU, users can
reliably obtain the data of up to 240 PMU events with the sample interval
of events down to 1ms, while the software overhead of accessing PMUs, as
well as its impact on target workloads, is reduced.

Signed-off-by: Jie Zhan <zhanjie9@xxxxxxxxxxxxx>
Nice documentation. I've read this a few times before, but on this read
through wondered if we could say anything about the skew between capture
of the counters. Not that important though so I'm happy to add

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>

though this may of course need updating significantly as the interface
is refined (the RFC question you raised for example in the cover letter).

Thanks

Jonathan
---
Documentation/admin-guide/perf/hisi-pmcu.rst | 183 +++++++++++++++++++
Documentation/admin-guide/perf/index.rst | 1 +
2 files changed, 184 insertions(+)
create mode 100644 Documentation/admin-guide/perf/hisi-pmcu.rst

diff --git a/Documentation/admin-guide/perf/hisi-pmcu.rst b/Documentation/admin-guide/perf/hisi-pmcu.rst
new file mode 100644
index 000000000000..50d17cbd0049
--- /dev/null
+++ b/Documentation/admin-guide/perf/hisi-pmcu.rst
@@ -0,0 +1,183 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================================
+HiSilicon Performance Monitor Control Unit
+==========================================
+
+Introduction
+============
+
+HiSilicon Performance Monitor Control Unit (PMCU) is a device that offloads
+PMU accesses from CPUs, handling the configuration, event switching, and
+counter reading of core PMUs on Kunpeng SoC. It facilitates fine-grained
+and multi-PMU-event CPU profiling, in which scenario the current ``perf``
+scheme may lose events or drop sampling frequency. With PMCU, users can
+reliably obtain the data of up to 240 PMU events with the sample interval
+of events down to 1ms, while the software overhead of accessing PMUs, as
+well as its impact on target workloads, is reduced.
+
+Each CPU die is equipped with a PMCU device. The PMCU driver registers it as a
+PMU device, named as ``hisi_pmcu_sccl<N>``, where ``<N>`` is the corresponding
+CPU die ID. When triggered, PMCU reads event IDs and pass them to PMUs in all
+CPUs on the CPU die it is on. PMCU then starts the counters for counting
+events, waits for a time interval, and stops them. The PMU counter readings are
+dumped from hardware to memory, i.e. perf AUX buffers, and further copied to
+the ``perf.data`` file in the user space. PMCU automatically switches events
+(when there are more events than available PMU counters) and completes multiple
+rounds of PMU event counting in one trigger.
+
+Hardware overview
+=================
+
+On Kunpeng SoC, each CPU die is equipped with a PMCU device. PMCU acts like an
+assistant to access the core PMUs on its die and move the counter readings to
+memory. An overview of PMCU's hardware organization is shown below::
+
+ +--------------------+
+ | Memory |
+ | +------+ +-------+ |
+ +--------+ | |Events| |Samples| |
+ | PMCU | | +------+ +-------+ |
+ +---|----+ +---------|----------+
+ | |
+ ======================================================= Bus
+ | | |
+ +----------|----------+ +----------|----------+ |
+ | +------+ | +------+ | | +------+ | +------+ | |
+ | |Core 0| | |Core 1| | | |Core 0| | |Core 1| | |
+ | +--|---+ | +--|---+ | | +--|---+ | +--|---+ | (More
+ | +-----+----+ | | +-----+----+ | clusters
+ | +--|---+ +--|---+ | | +--|---+ +--|---+ | ...)
+ | |Core 2| |Core 3| | | |Core 2| |Core 3| |
+ | +------+ +------+ | | +------+ +------+ |
+ | CPU Cluster 0 | | CPU Cluster 1 |
+ +---------------------+ +---------------------+
+
+On Kunpeng SoC, a CPU die is formed of several CPU clusters and several
+CPUs per cluster. PMCU is able to access the core PMUs in these CPUs.
+The main job of PMCU is to fetch PMU event IDs from memory, make PMUs count the
+events for a while, and move the counter readings back to memory.
+
+Once triggered, PMCU performs a number of loops and processes a number of
+events in each loop. It fetches ``nr_pmu`` events from memory at a time, where
+``nr_pmu`` denotes the number of PMU counters to be used in each CPU. The
+``nr_pmu`` events are passed to the PMU counters of all CPUs on the CPU die
+where PMCU resides. Then, PMCU starts all the counters, waits for a period,
+stops all the counters, and moves the counter readings to memory, before
+handling the next ``nr_pmu`` events if there are more events to process in this
+loop. The number of loops and ``nr_pmu`` are determined by the driver, whereas
+the number of events to process depends on user inputs. The counters are
+stopped when PMCU reads counters and switches events, so there is a tiny time
+window during which the events are not counted.
I'm not clear from this description whether there is 'skew' between the counters
(beyond the normal issues from uarch). Does the PMCU stop all counters
then read them all (minimizing skew) or does it stop each CPUs set of counters
and read those, or stop each individual counter before reading?

My impression is that this feature is meant to be left running over timescales
much longer than the sampling period so it may not be necessary to align the
different lines on the resulting graphs perfectly. Hence maybe this doesn't matter.
Thanks for pointing this out.

The PMCU stops all the counters before reading any counters (i.e. the
first case you said).

The basic procedure is:
    start counters -> wait -> stop counters -> read and reset counters
-> switch events -> start counters -> ...
where each step applys to all CPUs and counters.
Great. So this is across all cores on a die so skew should be minimized
(at a cost of missing more events than a skew heavy approach).

The counters don't count during the tiny stop-start window.
I guess a small improvement would be: reset -> read -> switch -> reset
-> ..., while the counters keep running,
but we still lose some event counts between read and reset, and thus, no
fundamental differrence.
Lots of ways to reduce both skew and missed counts, but I think you are
right in that none of them matter for the intended long term monitoring
usecase.

Jonathan
Yeah it focuses more on general workload characteristics than time-senstive and
precise program analysis.

Jie