Re: [PATCH v2 3/6] perf pmu: Add support for event.cpus files in sysfs

From: Liang, Kan
Date: Fri Jul 26 2024 - 10:07:43 EST




On 2024-07-26 4:17 a.m., Dhananjay Ugwekar wrote:
>
>
> On 7/26/2024 1:22 PM, Ian Rogers wrote:
>> On Fri, Jul 26, 2024 at 12:10 AM Dhananjay Ugwekar
>> <Dhananjay.Ugwekar@xxxxxxx> wrote:
>>>
>>>
>>>
>>> On 7/26/2024 12:36 PM, Dhananjay Ugwekar wrote:
>>>> Hello, Ian, Kan,
>>>>
>>>> On 7/20/2024 3:32 AM, Ian Rogers wrote:
>>>>> On Fri, Jul 19, 2024 at 9:35 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
>>>>>> On 2024-07-19 10:59 a.m., Ian Rogers wrote:
>>>>>>> Thanks Kan. I'm not wondering about a case of 2 CPUs, say on CPU0 and
>>>>>>> solely its perf event context, I want to know its core power and
>>>>>>> package power as a group so I never record one without the other. That
>>>>>>> grouping wouldn't be possible with 2 PMUs.
>>>>>>
>>>>>> For power, to be honest, I don't think it improves anything. It gives
>>>>>> users a false image that perf can group these counters.
>>>>>> But the truth is that perf cannot. The power counters are all
>>>>>> free-running counters. It's impossible to co-schedule them (which
>>>>>> requires a global mechanism to disable/enable all counters, e.g.,
>>>>>> GLOBAL_CTRL for core PMU). The kernel still has to read the counters one
>>>>>> by one while the counters keep running. There are no differences with or
>>>>>> without a group for the power events.
>>>>>
>>>>> Ok, so power should copy cstate with _core, _pkg, etc. I agree the
>>>>> difference is small and I like the idea of being consistent.
>>>>
>>>> So, it seems we want to follow the new PMU addition approach for RAPL
>>>> being consistent with Intel cstate driver, should I revive my "power_per_core"
>>>> PMU thread now?
>>>
>>> The power_per_core PMU thread link for reference,
>>>
>>> https://lore.kernel.org/all/20240711102436.4432-1-Dhananjay.Ugwekar@xxxxxxx/
>>
>> I think so. Would it be possible to follow the same naming convention
>> as cstate, where there is cstate_pkg and cstate_core? (ie no "_per" in
>> the name)
>
> Makes sense, we should probably rename the original "power" PMU to "power_pkg"
> as well.

It may brings some compatible issue for the old platforms. There may be
two ways to address it.
- Add a symlink or something to link the "power" and "power_pkg".
- Only when there are two or more different scopes of counters in a
system, the "power_<scope>" are used. If there is only one scope of
power counter, "power" is still used.
The latter method is used for the Intel uncore and hybrid core drivers now.

Thanks,
Kan

>
> Thanks,
> Dhananjay
>
>>
>> Thanks,
>> Ian
>>
>>>>
>>>> Thanks,
>>>> Dhananjay
>>>>
>>>> Do we
>>>>> want to add "event.cpus" support to the tool anyway for potential
>>>>> future uses? This would at least avoid problems with newer kernels and
>>>>> older perf tools were we to find a good use for it.
>>>>>
>>>>>>> My understanding had been that for core PMUs a "perf stat -C" option
>>>>>>> would choose the particular CPU to count the event on, for an uncore
>>>>>>> PMU the -C option would override the cpumask's "default" value. We
>>>>>>> have code to validate this:
>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/evlist.c?h=perf-tools-next#n2522
>>>>>>> But it seems now that overriding an uncore PMU's default CPU is
>>>>>>> ignored.
>>>>>>
>>>>>> For the uncore driver, no matter what -C set, it writes the default CPU
>>>>>> back.
>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/arch/x86/events/intel/uncore.c#n760
>>>>>>
>>>>>>> If you did:
>>>>>>> $ perf stat -C 1 -e data_read -a sleep 0.1
>>>>>>> Then the tool thinks data_read is on CPU1 and will set its thread
>>>>>>> affinity to CPU1 to avoid IPIs. It seems to fix this we need to just
>>>>>>> throw away the -C option.
>>>>>> The perf tool can still read the the counter from CPU1 and no IPIs
>>>>>> because of the PMU_EV_CAP_READ_ACTIVE_PKG().
>>>>>>
>>>>>> Not quite sure, but it seems only the open and close may be impacted and
>>>>>> silently changed to CPU0.
>>>>>
>>>>> There's also enable/disable. Andi did the work and there were some
>>>>> notable gains but likely more on core events. Ultimately it'd be nice
>>>>> to be opening, closing.. everything in parallel given the calls are
>>>>> slow and the work is embarrassingly parallel.
>>>>> It feels like the cpumasks for uncore could still do with some cleanup
>>>>> wrt -C I'm just unsure at the moment what this should be. Tbh, I'm
>>>>> tempted to rewrite evlist propagate maps as someone may look at it and
>>>>> think I believe in what it is doing. The parallel stuff we should grab
>>>>> Riccardo's past work.
>>>>>
>>>>> Thanks,
>>>>> Ian
>>>>>
>>>>>
>>>>>> Thanks,
>>>>>> Kan
>>>>>>>
>>>>>>>>> 2) do the /sys/devices/<pmu>/events/event.(unit|scale|per-pkg|snapshot)
>>>>>>>>> files parse correctly and have a corresponding event.
>>>>>>>>> 3) keep adding opening events on the PMU to a group to make sure that
>>>>>>>>> when counters are exhausted the perf_event_open fails (I've seen this
>>>>>>>>> bug on AMD)
>>>>>>>>> 4) are the values in the type file unique
>>>>>>>>>
>>>>>>>>
>>>>>>>> The rest sounds good to me.
>>>>>>>
>>>>>>> Cool. Let me know if you can think of more.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ian
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Kan
>>>>>>>
>>>
>