Re: [PATCH v4 00/11] Add per-core RAPL energy counter support for AMD CPUs

From: Ian Rogers
Date: Mon Jul 15 2024 - 11:23:12 EST


On Mon, Jul 15, 2024 at 2:36 AM Dhananjay Ugwekar
<Dhananjay.Ugwekar@xxxxxxx> wrote:
>
> Hello Ian,
>
> On 7/12/2024 3:53 AM, Ian Rogers wrote:
> > On Thu, Jul 11, 2024 at 3:25 AM Dhananjay Ugwekar
> > <Dhananjay.Ugwekar@xxxxxxx> wrote:
> >>
> >> Currently the energy-cores event in the power PMU aggregates energy
> >> consumption data at a package level. On the other hand the core energy
> >> RAPL counter in AMD CPUs has a core scope (which means the energy
> >> consumption is recorded separately for each core). Earlier efforts to add
> >> the core event in the power PMU had failed [1], due to the difference in
> >> the scope of these two events. Hence, there is a need for a new core scope
> >> PMU.
> >>
> >> This patchset adds a new "power_per_core" PMU alongside the existing
> >> "power" PMU, which will be responsible for collecting the new
> >> "energy-per-core" event.
> >
> > Sorry for being naive, is the only reason for adding the new PMU for
> > the sake of the cpumask? Perhaps we can add per event cpumasks like
> > say `/sys/devices/power/events/energy-per-core.cpumask` which contains
> > the CPUs of the different cores in this case. There's supporting
> > hotplugging CPUs as an issue. Adding the tool support for this
> > wouldn't be hard and it may be less messy (although old perf tools on
> > new kernels wouldn't know about these files).
>
> I went over the two approaches and below are my thoughts,
>
> New PMU approach:
> Pros
> * It will work with older perf tools, hence these patches can be backported to an older kernel and the new per-core event will work there as well.
> Cons
> * More code changes in rapl.c
>
> Event specific cpumask approach:
> Pros
> * It might be easier to add diff scope events within the same PMU in future(although currently I'm not able to find such a usecase, apart from the RAPL pkg and core energy counters)
> Cons
> * Both new kernel and perf tool will be required to use the new per-core event.
>
> I feel that while the event-specific cpumask is a viable alternative to the new PMU addition approach, I dont see any clear pros to select that over the current approach. Please let me know if you have any design related concerns to the addition of new PMU or your concern is mostly about the amount of code changes in this approach.

Thanks Dhananjay, and thanks for taking the time for an objective
discussion on the mailing list. I'm very supportive of seeing the work
you are enabling land.

My concern comes from the tool side. If every PMU starts to have
variants for the sake of the cpumask what does this mean for
aggregation in the perf tool? There is another issue around event
grouping, you can't group events across PMUs, but my feeling is that
event grouping needs to be rethought. By default the power_per_core
events are going to be aggregated together by the perf tool, which
then loses their per_core-ness.

I was trying to think of the same problem but in other PMUs. One
thought I had was the difference between hyperthread and core events.
At least on Intel, some events can only count for the whole core not
per hyperthread. The events don't have a cpu_per_core PMU, they just
use the regular cpu one, and so the cpumask is set to all online
hyperthreads. When a per-core event is programmed it will get
programmed on every hyperthread and so counted twice for the core.
This at the least wastes a counter, but it probably also yields twice
the expected count as every event is counted twice then aggregated. So
this is just wrong and the user is expected to be smart and fix it
(checking the x86 events there is a convention to use a ".ALL" or
".ANY" suffix for core wide events iirc). If we had a cpumask for
these events then we could avoid the double setting, free up a counter
and avoid double counting. Were we to fix things the way it is done in
this patch series we'd add another PMU.

My feeling is that in the longer term a per event cpumask looks
cleaner. I think either way you need a new kernel for the new RAPL
events. The problem with an old perf tool and a new kernel, this
doesn't normally happen with distributions as they match the perf tool
to the kernel version needlessly losing features and fixes along the
way. If the new PMU is going to get backported through fixes.. then we
can do similar for reading the per event cpumask. I'd be tempted not
to do this and focus on the next LTS kernel, getting the kernel and
tool fixes in as necessary.

Thanks,
Ian


> Thanks,
> Dhananjay
>
> >
> > Thanks,
> > Ian
> >
> >
> >> Tested the package level and core level PMU counters with workloads
> >> pinned to different CPUs.
> >>
> >> Results with workload pinned to CPU 1 in Core 1 on an AMD Zen4 Genoa
> >> machine:
> >>
> >> $ perf stat -a --per-core -e power_per_core/energy-per-core/ -- sleep 1
> >>
> >> Performance counter stats for 'system wide':
> >>
> >> S0-D0-C0 1 0.02 Joules power_per_core/energy-per-core/
> >> S0-D0-C1 1 5.72 Joules power_per_core/energy-per-core/
> >> S0-D0-C2 1 0.02 Joules power_per_core/energy-per-core/
> >> S0-D0-C3 1 0.02 Joules power_per_core/energy-per-core/
> >> S0-D0-C4 1 0.02 Joules power_per_core/energy-per-core/
> >> S0-D0-C5 1 0.02 Joules power_per_core/energy-per-core/
> >> S0-D0-C6 1 0.02 Joules power_per_core/energy-per-core/
> >> S0-D0-C7 1 0.02 Joules power_per_core/energy-per-core/
> >> S0-D0-C8 1 0.02 Joules power_per_core/energy-per-core/
> >> S0-D0-C9 1 0.02 Joules power_per_core/energy-per-core/
> >> S0-D0-C10 1 0.02 Joules power_per_core/energy-per-core/
> >>
> >> [1]: https://lore.kernel.org/lkml/3e766f0e-37d4-0f82-3868-31b14228868d@xxxxxxxxxxxxxxx/
> >>
> >> This patchset applies cleanly on top of v6.10-rc7 as well as latest
> >> tip/master.
> >>
> >> v4 changes:
> >> * Add patch 11 which removes the unused function cpu_to_rapl_pmu()
> >> * Add Rui's rb tag for patch 1
> >> * Invert the pmu scope check logic in patch 2 (Peter)
> >> * Add comments explaining the scope check in patch 2 (Peter)
> >> * Use cpumask_var_t instead of cpumask_t in patch 5 (Peter)
> >> * Move renaming code to patch 8 (Rui)
> >> * Reorder the cleanup order of per-core and per-pkg PMU in patch 10 (Rui)
> >> * Add rapl_core_hw_unit variable to store the per-core PMU unit in patch
> >> 10 (Rui)
> >>
> >> PS: Scope check logic is still kept the same (i.e., all Intel systems being
> >> considered as die scope), Rui will be modifying it to limit the die-scope
> >> only to Cascadelake-AP in a future patch on top of this patchset.
> >>
> >> v3 changes:
> >> * Patch 1 added to introduce the logical_core_id which is unique across
> >> the system (Prateek)
> >> * Use the unique topology_logical_core_id() instead of
> >> topology_core_id() (which is only unique within a package on tested
> >> AMD and Intel systems) in Patch 10
> >>
> >> v2 changes:
> >> * Patches 6,7,8 added to split some changes out of the last patch
> >> * Use container_of to get the rapl_pmus from event variable (Rui)
> >> * Set PERF_EV_CAP_READ_ACTIVE_PKG flag only for pkg scope PMU (Rui)
> >> * Use event id 0x1 for energy-per-core event (Rui)
> >> * Use PERF_RAPL_PER_CORE bit instead of adding a new flag to check for
> >> per-core counter hw support (Rui)
> >>
> >> Dhananjay Ugwekar (10):
> >> perf/x86/rapl: Fix the energy-pkg event for AMD CPUs
> >> perf/x86/rapl: Rename rapl_pmu variables
> >> perf/x86/rapl: Make rapl_model struct global
> >> perf/x86/rapl: Move cpumask variable to rapl_pmus struct
> >> perf/x86/rapl: Add wrapper for online/offline functions
> >> perf/x86/rapl: Add an argument to the cleanup and init functions
> >> perf/x86/rapl: Modify the generic variable names to *_pkg*
> >> perf/x86/rapl: Remove the global variable rapl_msrs
> >> perf/x86/rapl: Add per-core energy counter support for AMD CPUs
> >> perf/x86/rapl: Remove the unused function cpu_to_rapl_pmu
> >>
> >> K Prateek Nayak (1):
> >> x86/topology: Introduce topology_logical_core_id()
> >>
> >> Documentation/arch/x86/topology.rst | 4 +
> >> arch/x86/events/rapl.c | 454 ++++++++++++++++++--------
> >> arch/x86/include/asm/processor.h | 1 +
> >> arch/x86/include/asm/topology.h | 1 +
> >> arch/x86/kernel/cpu/debugfs.c | 1 +
> >> arch/x86/kernel/cpu/topology_common.c | 1 +
> >> 6 files changed, 328 insertions(+), 134 deletions(-)
> >>
> >> --
> >> 2.34.1
> >>