Re: [PATCH v2 3/6] perf pmu: Add support for event.cpus files in sysfs

From: Dhananjay Ugwekar
Date: Fri Jul 26 2024 - 04:17:59 EST




On 7/26/2024 1:22 PM, Ian Rogers wrote:
> On Fri, Jul 26, 2024 at 12:10 AM Dhananjay Ugwekar
> <Dhananjay.Ugwekar@xxxxxxx> wrote:
>>
>>
>>
>> On 7/26/2024 12:36 PM, Dhananjay Ugwekar wrote:
>>> Hello, Ian, Kan,
>>>
>>> On 7/20/2024 3:32 AM, Ian Rogers wrote:
>>>> On Fri, Jul 19, 2024 at 9:35 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
>>>>> On 2024-07-19 10:59 a.m., Ian Rogers wrote:
>>>>>> Thanks Kan. I'm not wondering about a case of 2 CPUs, say on CPU0 and
>>>>>> solely its perf event context, I want to know its core power and
>>>>>> package power as a group so I never record one without the other. That
>>>>>> grouping wouldn't be possible with 2 PMUs.
>>>>>
>>>>> For power, to be honest, I don't think it improves anything. It gives
>>>>> users a false image that perf can group these counters.
>>>>> But the truth is that perf cannot. The power counters are all
>>>>> free-running counters. It's impossible to co-schedule them (which
>>>>> requires a global mechanism to disable/enable all counters, e.g.,
>>>>> GLOBAL_CTRL for core PMU). The kernel still has to read the counters one
>>>>> by one while the counters keep running. There are no differences with or
>>>>> without a group for the power events.
>>>>
>>>> Ok, so power should copy cstate with _core, _pkg, etc. I agree the
>>>> difference is small and I like the idea of being consistent.
>>>
>>> So, it seems we want to follow the new PMU addition approach for RAPL
>>> being consistent with Intel cstate driver, should I revive my "power_per_core"
>>> PMU thread now?
>>
>> The power_per_core PMU thread link for reference,
>>
>> https://lore.kernel.org/all/20240711102436.4432-1-Dhananjay.Ugwekar@xxxxxxx/
>
> I think so. Would it be possible to follow the same naming convention
> as cstate, where there is cstate_pkg and cstate_core? (ie no "_per" in
> the name)

Makes sense, we should probably rename the original "power" PMU to "power_pkg"
as well.

Thanks,
Dhananjay

>
> Thanks,
> Ian
>
>>>
>>> Thanks,
>>> Dhananjay
>>>
>>> Do we
>>>> want to add "event.cpus" support to the tool anyway for potential
>>>> future uses? This would at least avoid problems with newer kernels and
>>>> older perf tools were we to find a good use for it.
>>>>
>>>>>> My understanding had been that for core PMUs a "perf stat -C" option
>>>>>> would choose the particular CPU to count the event on, for an uncore
>>>>>> PMU the -C option would override the cpumask's "default" value. We
>>>>>> have code to validate this:
>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/evlist.c?h=perf-tools-next#n2522
>>>>>> But it seems now that overriding an uncore PMU's default CPU is
>>>>>> ignored.
>>>>>
>>>>> For the uncore driver, no matter what -C set, it writes the default CPU
>>>>> back.
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/arch/x86/events/intel/uncore.c#n760
>>>>>
>>>>>> If you did:
>>>>>> $ perf stat -C 1 -e data_read -a sleep 0.1
>>>>>> Then the tool thinks data_read is on CPU1 and will set its thread
>>>>>> affinity to CPU1 to avoid IPIs. It seems to fix this we need to just
>>>>>> throw away the -C option.
>>>>> The perf tool can still read the the counter from CPU1 and no IPIs
>>>>> because of the PMU_EV_CAP_READ_ACTIVE_PKG().
>>>>>
>>>>> Not quite sure, but it seems only the open and close may be impacted and
>>>>> silently changed to CPU0.
>>>>
>>>> There's also enable/disable. Andi did the work and there were some
>>>> notable gains but likely more on core events. Ultimately it'd be nice
>>>> to be opening, closing.. everything in parallel given the calls are
>>>> slow and the work is embarrassingly parallel.
>>>> It feels like the cpumasks for uncore could still do with some cleanup
>>>> wrt -C I'm just unsure at the moment what this should be. Tbh, I'm
>>>> tempted to rewrite evlist propagate maps as someone may look at it and
>>>> think I believe in what it is doing. The parallel stuff we should grab
>>>> Riccardo's past work.
>>>>
>>>> Thanks,
>>>> Ian
>>>>
>>>>
>>>>> Thanks,
>>>>> Kan
>>>>>>
>>>>>>>> 2) do the /sys/devices/<pmu>/events/event.(unit|scale|per-pkg|snapshot)
>>>>>>>> files parse correctly and have a corresponding event.
>>>>>>>> 3) keep adding opening events on the PMU to a group to make sure that
>>>>>>>> when counters are exhausted the perf_event_open fails (I've seen this
>>>>>>>> bug on AMD)
>>>>>>>> 4) are the values in the type file unique
>>>>>>>>
>>>>>>>
>>>>>>> The rest sounds good to me.
>>>>>>
>>>>>> Cool. Let me know if you can think of more.
>>>>>>
>>>>>> Thanks,
>>>>>> Ian
>>>>>>
>>>>>>> Thanks,
>>>>>>> Kan
>>>>>>
>>