Re: [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension

From: Kim Phillips
Date: Wed Jun 28 2017 - 21:00:09 EST

Next message: Adam Borowski: "Re: [PATCH] lib/zstd: use div_u64() to let it build on 32-bit"
Previous message: William Koh: "Re: [PATCH] fs: ext4: inode->i_generation not assigned 0."
In reply to: Mark Rutland: "Re: [PATCH] perf tools: Add ARM Statistical Profiling Extensions (SPE) support"
Next in thread: Mark Rutland: "Re: [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, 28 Jun 2017 12:26:02 +0100
Mark Rutland <mark.rutland@xxxxxxx> wrote:

> On Tue, Jun 27, 2017 at 04:07:58PM -0500, Kim Phillips wrote:
> > I'm close to finishing the bts version of userspace, and have been
> > testing a bit more thoroughly, so now I consistently see the excessive
> > PADding when recording a CPU that's idle. I.e., when I taskset the perf
> > record to the same CPU I specify to record's -C (taskset -c n perf
> > record -C n), I get max. twenty-odd number of PAD bytes at the end of
> > the AUX buffers in the perf.data file. If, OTOH, I taskset -c n perf
> > record -C m, where m != n, I get a couple of valid event records in the
> > buffer, and the rest of the buffer is filled with PADding.
> >
> > It wouldn't be a problem except that it's wastes too much space
> > sometimes. Here is a good output buffer sample from a --mmap-pages=,12
> > run, with only 4 PADs tacked onto the end:
> >
> > 0xd190 [0x30]: PERF_RECORD_AUXTRACE size: 0x48 offset: 0 ref: 0xe914f7e3ce idx: 0 tid: -1 cpu: 2
> > .
> > . ... ARM SPE data: size 72 bytes
> > . 00000000: 4a 01 B COND
>
> [...]
>
> > . 0000003b: 71 a5 39 e1 14 e9 00 00 00 TS 1001077684645
> > . 00000044: 00 PAD
> > . 00000045: 00 PAD
> > . 00000046: 00 PAD
> > . 00000047: 00 PAD
> >
> > whereas this one - from later on in the same run - is over 99% PADs:
> >
> > 0xd250 [0x30]: PERF_RECORD_AUXTRACE size: 0x5fc0 offset: 0xfffff4ae0044 ref: 0xe91cead1dd idx: 0 tid: -1 cpu: 2
> > .
> > . ... ARM SPE data: size 24512 bytes
> > . 00000000: 4a 00 B
>
> [...]
>
> > . 000000b0: 71 8f 4e e1 14 e9 00 00 00 TS 1001077689999
> > . 000000b9: 00 PAD
> > ...ALL PADs...ALL PADs...ALL PADs...ALL PADs...ALL PADs...ALL PADs...
> > . 00005fbf: 00 PAD
>
> Interesting.
>
> If you cat /proc/interrupts, do you see many more SPE interrupts on CPU
> n than on m?

When n == m, I see approx. 1 IRQ per SPE buffer full.

When n != m, I see neither CPU n or m incur SPE interrupts; the
workload ran but didn't get recorded, or, rather, 'idleness' got
recorded instead.

> Otherwise, I wonder if this is some odd interaction with idle. Can you
> try to forcefully load that other CPU?
>
> e.g. run something like:
>
> taskset -c <n> sh -c 'while true; do done'
>
> ... in parallel with the tracer.

If I do a:

taskset -c 1 sh -c 'while true; do echo blah > /dev/null' &
taskset -c 0 perf record -C 1 ...

then non-idleness and non-PADdingness get recorded.

> For reference, what was your event sample period (i.e. the value of
> perf_event_attr::sample_period)?
>
> Did you modify that at all with PERF_EVENT_IOC_PERIOD?

If that's the same as 'perf record -c <period>', then, yes, I set
the period to values such as 512, 1024.

> > > > Meanwhile, when using fvp-base.dtb, my model setup stops booting the
> > > > kernel after "smp: Bringing up secondary CPUs ...". If I however take
> > > > the second SPE node from fvp-base.dts and add it to my working device
> > > > tree, I get this during the driver probe:
> > > >
> > > > [ 1.042063] arm_spe_pmu spe-pmu@0: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]
> > > > [ 1.043582] arm_spe_pmu spe-pmu@1: probed for CPUs 0-7 [max_record_sz 64, align 1, features 0xf]
> > > > [ 1.043631] genirq: Flags mismatch irq 6. 00004404 (arm_spe_pmu) vs. 00004404 (arm_spe_pmu)
> > >
> > > Looks like you've screwed up your IRQ partitions, so you are effectively
> > > registering the same device twice, which then blows up due to lack of shared
> > > irqs.
> > >
> > > Either remove one of the devices, or use IRQ partitions to restrict them
> > > to unique sets of CPUs.
> >
> > Right, but since I want to get parity with what you're running -
> > fvp_base.dtb - I tried to debug the hang after "smp: Bringing up
> > secondary CPUs ..." problem, and could only debug it to the PSCI driver
> > hitting one of these cases:
> >
> > case PSCI_RET_INVALID_PARAMS:
> > case PSCI_RET_INVALID_ADDRESS:
>
> Sounds like your DT is describing CPUs that don't exist (or perhaps the
> same CPU several times). Certainly, PSCI and the kernel disagree on
> which CPUS exist.
>
> What exact DT are you using?

the one this commit to linux-will's perf/spe branch provides:

commit 2a73de57eaf61cdfd61be1e20a46e4a2c326775f
Author: Marc Zyngier <marc.zyngier@xxxxxxx>
Date: Tue Mar 11 18:14:45 2014 +0000

arm64: dts: add model device-tree

Signed-off-by: Marc Zyngier <marc.zyngier@xxxxxxx>
Signed-off-by: Will Deacon <will.deacon@xxxxxxx>

> Are you using the bootwrapper, or ATF? I'm guessing you're using the
> bootwrapper.

I'm using the wrapper to wrap arm-trusted-firmware (ATF?) objects, so,
both? I noticed the wrapper I was using was pretty old, so I updated
it.

arm-trusted-firmware, btw, has just been updated to enable SPE at lower
ELs, so I don't have to use a hacked-up version anymore.

I also updated my BL33 to the latest upstream u-boot
vexpress_aemv8a_dram_defconfig, and at least now the kernel continues
to boot, even though it can't bring up 6 of the 7 secondary CPUs.

> Which version of the bootwrapepr are you using? If it doesn't have
> commit:
>
> ccdc936924b3682d ("Dynamically determine the set of CPUs")
>
> ... have you configured it appropriately with --with-cpu-ids?
>
> How is your model configured?

CLUSTER0_NUM_CORES=4
CLUSTER1_NUM_CORES=4

> Which CPU IDs does it think exist?

1,2,3,4,0x100,0x101,0x102,0x103

...which are different from the above device tree!:

0,0x100,0x200,0x300,0x10000,0x10100,0x10200,0x10300

So I imagine that's the problem, thanks!

I don't see how to tell the model to put the CPUs at different
addresses, only a lot of GICv3 redistributor switches? btw, where can
I get updates to the run-model.sh scripts? Answer off-list?

> > Note: it's yet another place I have to manually instrument the error
> > path in a kernel driver in lieu of it being more naturally verbose by
> > itself; I *implore* you to reconsider adding proper user messaging to
> > arm_spe_pmu_event_init().
>
> Given this is a FW configuration issue (i.e. a system-level error), I'm
> more than happy to make the PSCI driver messages more helpful where
> possible.
>
> That's completely orthogonal to the SPE debug messages for requests made
> by the user.

I respectfully disagree, given the current state of the interfaces
involved.

> > I can't tell which part of the fvp-base device tree is not liked by the
> > firmware; I tried different combinations of the PSCI node, different CPU
> > enumerations (cpu@100 vs cpu@1), removing idle-states properties...any
> > hints appreciated.
>
> The bootwrapper doesn't support idle. So no idle-states should be in the
> DT.
>
> If you can share your DT, bootwrapper configuration, and model
> configuration, I can try to debug this with you.

I reverted the wrapper's ccdc936924b3682d ("Dynamically determine the
set of CPUs") commit you mentioned above, and specified the cpu-ids
manually, and am now running with 8 CPUs, although linux enumerates
them as 0,1,8,9,10,11,12,13?

Thanks for your continued support,

Kim

Next message: Adam Borowski: "Re: [PATCH] lib/zstd: use div_u64() to let it build on 32-bit"
Previous message: William Koh: "Re: [PATCH] fs: ext4: inode->i_generation not assigned 0."
In reply to: Mark Rutland: "Re: [PATCH] perf tools: Add ARM Statistical Profiling Extensions (SPE) support"
Next in thread: Mark Rutland: "Re: [PATCH v4 0/5] Add support for the ARMv8.2 Statistical Profiling Extension"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]