Re: [PATCH v4 3/4] perf/core: Remove pmu linear searching code

From: Ian Rogers
Date: Tue May 30 2023 - 10:00:40 EST


On Tue, May 30, 2023 at 12:45 AM Thomas Richter <tmricht@xxxxxxxxxxxxx> wrote:
>
> On 5/27/23 20:38, Marc Zyngier wrote:
> > On Sat, 27 May 2023 18:00:13 +0100,
> > Ian Rogers <irogers@xxxxxxxxxx> wrote:
> >>
> >> On Sat, May 27, 2023 at 6:32 AM Marc Zyngier <maz@xxxxxxxxxx> wrote:
> >>>
> >>> On Sat, 27 May 2023 00:00:47 +0100,
> >>> Ian Rogers <irogers@xxxxxxxxxx> wrote:
> >>>>
> >>>> On Thu, May 25, 2023 at 8:56 AM Oliver Upton <oliver.upton@xxxxxxxxx> wrote:
> >>>>>
> >>>>> On Thu, May 25, 2023 at 04:20:31PM +0200, Peter Zijlstra wrote:
> >>>>>> On Thu, May 25, 2023 at 07:11:41AM +0000, Oliver Upton wrote:
> >>>>>>
> >>>>>>> The PMUv3 driver does pass a name, but it relies on getting back an
> >>>>>>> allocated pmu id as @type is -1 in the call to perf_pmu_register().
> >>>>>>>
> >>>>>>> What actually broke is how KVM probes for a default core PMU to use for
> >>>>>>> a guest. kvm_pmu_probe_armpmu() creates a counter w/ PERF_TYPE_RAW and
> >>>>>>> reads the pmu from the returned perf_event. The linear search had the
> >>>>>>> effect of eventually stumbling on the correct core PMU and succeeding.
> >>>>>>>
> >>>>>>> Perf folks: is this WAI for heterogenous systems?
> >>>>>>
> >>>>>> TBH, I'm not sure. hetero and virt don't mix very well AFAIK and I'm not
> >>>>>> sure what ARM64 does here.
> >>>>>>
> >>>>>> IIRC the only way is to hard affine things; that is, force vCPU of
> >>>>>> 'type' to the pCPU mask of 'type' CPUs.
> >>>>>
> >>>>> We provide absolutely no illusion of consistency across implementations.
> >>>>> Userspace can select the PMU type, and then it is a userspace problem
> >>>>> affining vCPUs to the right pCPUs.
> >>>>>
> >>>>> And if they get that wrong, we just bail and refuse to run the vCPU.
> >>>>>
> >>>>>> If you don't do that; or let userspace 'override' that, things go
> >>>>>> sideways *real* fast.
> >>>>>
> >>>>> Oh yeah, and I wish PMUs were the only problem with these hetero
> >>>>> systems...
> >>>>
> >>>> Just to add some context from what I understand. There are inbuilt
> >>>> type numbers for PMUs:
> >>>> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/include/uapi/linux/perf_event.h?h=perf-tools-next#n34
> >>>> so the PMU generally called /sys/devices/cpu should have type 4 (ARM
> >>>> give it another name). For heterogeneous ARM there is a single PMU and
> >>>> the same events are programmed regardless of whether it is a big or a
> >>>> little core - the cpumask lists all CPUs.
> >>>
> >>> I think you misunderstood the way heterogeneous arm64 systems are
> >>> described . Each CPU type gets its own PMU type, and its own event
> >>> list. Case in point:
> >>>
> >>> $ grep . /sys/devices/*pmu/{type,cpus}
> >>> /sys/devices/apple_avalanche_pmu/type:9
> >>> /sys/devices/apple_blizzard_pmu/type:8
> >>> /sys/devices/apple_avalanche_pmu/cpus:4-9
> >>> /sys/devices/apple_blizzard_pmu/cpus:0-3
> >>>
> >>> Type 4 (aka PERF_EVENT_RAW) is AFAICT just a way to encode the raw
> >>> event number, nothing else.
> >>
> >> Which PMU will a raw event open on?
> >
> > On the PMU that matches the current CPU.
> >
> >> Note, the raw events don't support
> >> the extended type that is present in PERF_TYPE_HARDWARE and
> >> PERF_TYPE_HW_CACHE:
> >> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/include/uapi/linux/perf_event.h#n41
> >> as the bits are already in use for being just plain config values.
> >
> > I'm not sure how relevant this is to the numbering of PMUs on arm64.
> >
> >> I suspect not being type 4 is a bug on apple ARM here.
> >
> > If that's a bug on this machine, it's a bug on all machines, at which
> > point it is the de-facto API:
> >
> > $ grep . /sys/devices/armv8*/{type,cpus}
> > /sys/devices/armv8_cortex_a53/type:8
> > /sys/devices/armv8_cortex_a72/type:9
> > /sys/devices/armv8_cortex_a53/cpus:0-3
> > /sys/devices/armv8_cortex_a72/cpus:4-5
> >
> > See, non-Apple HW. And now for a system with homogeneous CPUs:
> >
> > $ grep . /sys/devices/armv8*/{type,cpus}
> > /sys/devices/armv8_pmuv3_0/type:8
> > /sys/devices/armv8_pmuv3_0/cpus:0-159
> >
> > Still no type 4. I could go on for hours, I have plenty of HW around
> > me!
> >
> > So whatever your source of information is, it doesn't match reality.
> > Our PMUs are numbered arbitrarily, and have been so for... a very long
> > time. At least since perf_pmu_register has supported dynamic
> > registration (see 2e80a82a49c4c).
> >
> > Thanks,
> >
> > M.
> >
>
>
> I agree with Marc,
> on s390 we have 5 different PMUs and all have arbitrary numbers
> and have totally different features:
>
> # ll /sys/devices/{cpum,pai}*/type
> -r--r--r-- 1 root root 4096 May 30 09:33 /sys/devices/cpum_cf_diag/type
> -r--r--r-- 1 root root 4096 May 30 09:33 /sys/devices/cpum_cf/type
> -r--r--r-- 1 root root 4096 May 30 09:33 /sys/devices/cpum_sf/type
> -r--r--r-- 1 root root 4096 May 30 09:33 /sys/devices/pai_crypto/type
> -r--r--r-- 1 root root 4096 May 30 09:33 /sys/devices/pai_ext/type
> # grep . /sys/devices/{cpum,pai}*/type
> /sys/devices/cpum_cf_diag/type:9
> /sys/devices/cpum_cf/type:8
> /sys/devices/cpum_sf/type:4
> /sys/devices/pai_crypto/type:10
> /sys/devices/pai_ext/type:11
> #

Thanks Thomas, could you:
$ ls /sys/devices/*/cpu*
The assumption is that if a file called "cpus" exists then this a
"core" PMU, while "cpumask" would indicate an uncore PMU. The
exception is /sys/devices/cpu where there is no cpus but all online
CPUs are assumed to be in "cpus" cpu map. On Intel one of the core
PMUs has type 4 which aligns with the perf_event.h "ABI" constant
PERF_TYPE_RAW, so opening a type 4 event will open it on that PMU. The
question is I have is what should be the behavior be on non-Intel for
type 4, for the big.little/hybrid case it seems opening it on core
PMUs would make most sense and is what is done for PERF_TYPE_HARDWARE
and PERF_TYPE_HW_CACHE.

Thanks,
Ian

> Thanks Thomas
> --
> Thomas Richter, Dept 3303, IBM s390 Linux Development, Boeblingen, Germany
> --
> Vorsitzender des Aufsichtsrats: Gregor Pillen
> Geschäftsführung: David Faller
> Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294
>