[PATCH v2] perf tools: Add ARM Statistical Profiling Extensions (SPE) support

From: Kim Phillips
Date: Thu Aug 17 2017 - 23:12:01 EST


On Mon, 17 Jul 2017 19:48:22 -0500
Kim Phillips <kim.phillips@xxxxxxx> wrote:

> On Fri, 30 Jun 2017 15:02:41 +0100
> Mark Rutland <mark.rutland@xxxxxxx> wrote:
>
> > Hi Kim,
>
> Hi Mark,
>
> > On Wed, Jun 28, 2017 at 08:43:10PM -0500, Kim Phillips wrote:
> <snip>
> > > if (evlist) {
> > > evlist__for_each_entry(evlist, evsel) {
> > > if (cs_etm_pmu &&
> > > evsel->attr.type == cs_etm_pmu->type)
> > > found_etm = true;
> > > + if (arm_spe_pmu &&
> > > + evsel->attr.type == arm_spe_pmu->type)
> > > + found_spe = true;
> >
> > Given ARM_SPE_PMU_NAME is defined as "arm_spe_0", this won't detect all
> > SPE PMUs in heterogeneous setups (e.g. this'll fail to match "arm_spe_1"
> > and so on).
> >
> > Can we not find all PMUs with a "arm_spe_" prefix?
> >
> > ... or, taking a step back, do we need some sysfs "class" attribute to
> > identify multi-instance PMUs?
>
> Since there is one SPE per core, and it looks like the buffer full
> interrupt line is the only difference between the SPE device node
> specification in the device tree, I guess I don't understand why the
> driver doesn't accept a singular "arm_spe" from the tool, and manage
> interrupt handling accordingly. Also, if a set of CPUs are missing SPE
> support, and the user doesn't explicitly define a CPU affinity to
> outside that partition, then decline to run, stating why.
>
> This would also obviously help a lot from an ease-of-use perspective.
>
> <snip>
>
> > > +struct arm_spe_recording {
> > > + struct auxtrace_record itr;
> > > + struct perf_pmu *arm_spe_pmu;
> > > + struct perf_evlist *evlist;
> > > +};
> >
> > A user may wich to record trace on separate uarches simultaneously, so
> > having a singleton arm_spe_pmu for the entire evlist doesn't seem right.
> >
> > e.g. I don't see why we should allow a user to do:
> >
> > ./perf record -c 1024 \
> > -e arm_spe_0/ts_enable=1,pa_enable=0/ \
> > -e arm_spe_1/ts_enable=1,pa_enable=0/ \
> > ${WORKLOAD}
> >
> > ... which perf-record seems to accept today, but I don't seem to get any
> > aux trace, regardless of whether I taskset the entire thing to any
> > particular CPU.
>
> The implementation-defined components of the SPE don't affect a
> 'record'/capture operation, so a single arm_spe should be fine with
> separate uarch setups.
>
> > It also seems that the events aren't task bound, judging by -vv output:
>
> <snip>
>
> > I see something similar (i.e. perf doesn't try to bind the events to the
> > workload PID) when I try to record with only a single PMU. In that case,
> > perf-record blows up because it can't handle events on a subset of CPUs
> > (though it should be able to):
> >
> > nanook@torsk:~$ ./perf record -vv -c 1024 -e arm_spe_0/ts_enable=1,pa_enable=0/ true
>
> <snip>
>
> > mmap size 266240B
> > AUX area mmap length 131072
> > perf event ring buffer mmapped per cpu
> > failed to mmap AUX area
> > failed to mmap with 524 (INTERNAL ERROR: strerror_r(524, 0xffffc8596038, 512)=22)
>
> FWIW, that INTERNAL ERROR is fixed by this commit, btw:
>
> commit 8a1898db51a3390241cd5fae267dc8aaa9db0f8b
> Author: Hendrik Brueckner <brueckner@xxxxxxxxxxxxxxxxxx>
> Date: Tue Jun 20 12:26:39 2017 +0200
>
> perf/aux: Correct return code of rb_alloc_aux() if !has_aux(ev)
>
> So now it should return:
>
> failed to mmap with 95 (Operation not supported)
>
> > ... with a SW event, this works as expected, being bound to the workload PID:
>
> <snip>
>
> > ... so I guess this has something to do with the way the tool tries to
> > use the cpumask, maknig the wrong assumption that this implies
> > system-wide collection is mandatory / expected.
>
> Right, I'll take a look at it.
>
> > > + if (!opts->full_auxtrace)
> > > + return 0;
> > > +
> > > + /* We are in full trace mode but '-m,xyz' wasn't specified */
> > > + if (opts->full_auxtrace && !opts->auxtrace_mmap_pages) {
> > > + if (privileged) {
> > > + opts->auxtrace_mmap_pages = MiB(4) / page_size;
> > > + } else {
> > > + opts->auxtrace_mmap_pages = KiB(128) / page_size;
> > > + if (opts->mmap_pages == UINT_MAX)
> > > + opts->mmap_pages = KiB(256) / page_size;
> > > + }
> > > + }
> > > +
> > > + /* Validate auxtrace_mmap_pages */
> > > + if (opts->auxtrace_mmap_pages) {
> > > + size_t sz = opts->auxtrace_mmap_pages * (size_t)page_size;
> > > + size_t min_sz = KiB(8);
> > > +
> > > + if (sz < min_sz || !is_power_of_2(sz)) {
> > > + pr_err("Invalid mmap size for ARM SPE: must be at least %zuKiB and a power of 2\n",
> > > + min_sz / 1024);
> > > + return -EINVAL;
> > > + }
> > > + }
> > > +
> > > + /*
> > > + * To obtain the auxtrace buffer file descriptor, the auxtrace event
> > > + * must come first.
> > > + */
> > > + perf_evlist__to_front(evlist, arm_spe_evsel);
> >
> > Huh? *what* needs the auxtrace buffer fd?
> >
> > This seems really fragile. Can't we store this elsewhere?
>
> It's copied from the bts code, and the other auxtrace record users do
> the same; it looks like auxtrace record has implicit dependencies on it?
>
> > > +static u64 arm_spe_reference(struct auxtrace_record *itr __maybe_unused)
> > > +{
> > > + u64 ts;
> > > +
> > > + asm volatile ("isb; mrs %0, cntvct_el0" : "=r" (ts));
> > > +
> > > + return ts;
> > > +}
> >
> > I do not think it's a good idea to read the counter directly like this.
> >
> > What is this "reference" intended to be meaningful relative to?
>
> AFAICT, it's just a nonce the perf tool uses to track unique events,
> and I thought this better than the ETM driver's heavier get-random
> implementation.
>
> > Why do we need to do this in userspace?
> >
> > Can we not ask the kernel to output timestamps instead?
>
> Why? This gets the job done faster.
>
> > > +static int arm_spe_get_events(const unsigned char *buf, size_t len,
> > > + struct arm_spe_pkt *packet)
> > > +{
> > > + unsigned int events_len = payloadlen(buf[0]);
> > > +
> > > + if (len < events_len)
> > > + return ARM_SPE_NEED_MORE_BYTES;
> >
> > Isn't len the size of the whole buffer? So isn't this failing to account
> > for the header byte?
>
> well spotted; I changed /events_len/1 + events_len/.
>
> > > + packet->type = ARM_SPE_EVENTS;
> > > + packet->index = events_len;
> >
> > Huh? The events packet has no "index" field, so why do we need this?
>
> To identify Events with a less number of comparisons in arm_spe_pkt_desc():
> E.g., the LLC-ACCESS, LLC-REFILL, and REMOTE-ACCESS events are
> identified iff index > 1.
>
> > > + switch (events_len) {
> > > + case 1: packet->payload = *(uint8_t *)(buf + 1); break;
> > > + case 2: packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1)); break;
> > > + case 4: packet->payload = le32_to_cpu(*(uint32_t *)(buf + 1)); break;
> > > + case 8: packet->payload = le64_to_cpu(*(uint64_t *)(buf + 1)); break;
> > > + default: return ARM_SPE_BAD_PACKET;
> > > + }
> > > +
> > > + return 1 + events_len;
> > > +}
> > > +
> > > +static int arm_spe_get_data_source(const unsigned char *buf,
> > > + struct arm_spe_pkt *packet)
> > > +{
> > > + int len = payloadlen(buf[0]);
> > > +
> > > + packet->type = ARM_SPE_DATA_SOURCE;
> > > + if (len == 1)
> > > + packet->payload = buf[1];
> > > + else if (len == 2)
> > > + packet->payload = le16_to_cpu(*(uint16_t *)(buf + 1));
> > > +
> > > + return 1 + len;
> > > +}
> >
> > For those packets with a payload, the header has a uniform format
> > describing the payload size. Given that, can't we make the payload
> > extraction generic, regardless of the packet type?
> >
> > e.g. something like:
> >
> > static int arm_spe_get_payload(const unsigned char *buf, size_t len,
> > struct arm_spe_pkt *packet)
> > {
> > <determine paylaod size>
> > <length check>
> > <switch>
> > <return nr consumed bytes (inc header), or error>
> > }
> >
> > static int arm_spe_get_events(const unsigned char *buf, size_t len,
> > struct arm_spe_pkt *packet)
> > {
> > packet->type = ARM_SPE_EVENTS;
> > return arm_spe_get_payload(buf, len, packet);
> > }
> >
> > static int arm_spe_get_data_source(const unsigned char *buf,
> > struct arm_spe_pkt *packet)
> > {
> > packet->type = ARM_SPE_DATA_SOURCE;
> > return arm_spe_get_payload(buf, len, packet);
> > }
> >
> > ... and so on for the other packets with a payload.
>
> done for TIMESTAMP, EVENTS, DATA_SOURCE, CONTEXT, INSN_TYPE. It
> wouldn't fit ADDR and COUNTER well since they can occur in an
> extended-header, and their lengths are encoded differently, and are
> fixed anyway.
>
> > > +static int arm_spe_do_get_packet(const unsigned char *buf, size_t len,
> > > + struct arm_spe_pkt *packet)
> > > +{
> > > + unsigned int byte;
> > > +
> > > + memset(packet, 0, sizeof(struct arm_spe_pkt));
> > > +
> > > + if (!len)
> > > + return ARM_SPE_NEED_MORE_BYTES;
> > > +
> > > + byte = buf[0];
> > > + if (byte == 0)
> > > + return arm_spe_get_pad(packet);
> > > + else if (byte == 1) /* no timestamp at end of record */
> > > + return arm_spe_get_end(packet);
> > > + else if (byte & 0xc0 /* 0y11000000 */) {
> > > + if (byte & 0x80) {
> > > + /* 0x38 is 0y00111000 */
> > > + if ((byte & 0x38) == 0x30) /* address packet (short) */
> > > + return arm_spe_get_addr(buf, len, 0, packet);
> > > + if ((byte & 0x38) == 0x18) /* counter packet (short) */
> > > + return arm_spe_get_counter(buf, len, 0, packet);
> > > + } else
> > > + if (byte == 0x71)
> > > + return arm_spe_get_timestamp(buf, len, packet);
> > > + else if ((byte & 0xf) == 0x2)
> > > + return arm_spe_get_events(buf, len, packet);
> > > + else if ((byte & 0xf) == 0x3)
> > > + return arm_spe_get_data_source(buf, packet);
> > > + else if ((byte & 0x3c) == 0x24)
> > > + return arm_spe_get_context(buf, len, packet);
> > > + else if ((byte & 0x3c) == 0x8)
> > > + return arm_spe_get_insn_type(buf, packet);
> >
> > Could we have some mnemonics for these?
> >
> > e.g.
> >
> > #define SPE_HEADER0_PAD 0x0
> > #define SPE_HEADER0_END 0x1
> >
> > #define SPE_HEADER0_EVENTS 0x42
> > #define SPE_HEADER0_EVENTS_MASK 0xcf
> >
> > if (byte == SPE_HEADER0_PAD) {
> > ...
> > } else if (byte == SPE_HEADER0_END) {
> > ...
> > } else if ((byte & SPE_HEADER0_EVENTS_MASK) == SPE_HEADER0_EVENTS) {
> > ...
> > }
> >
> > ... which could even be turned into something table-driven.
>
> It'd be a pretty sparse table, so I doubt it'd be faster, but if it is,
> I'd just as soon leave that type of space tradeoff decision to the
> compiler, given its optimization directives.
>
> I'll take a look at replacing the constants that have named equivalents
> with their named versions, even though it was pretty clear already what
> they denoted, given the name of the function each branch was calling,
> and the comments.
>
> > > + } else if ((byte & 0xe0) == 0x20 /* 0y00100000 */) {
> > > + /* 16-bit header */
> > > + byte = buf[1];
> > > + if (byte == 0)
> > > + return arm_spe_get_alignment(buf, len, packet);
> > > + else if ((byte & 0xf8) == 0xb0)
> > > + return arm_spe_get_addr(buf, len, 1, packet);
> > > + else if ((byte & 0xf8) == 0x98)
> > > + return arm_spe_get_counter(buf, len, 1, packet);
> > > + }
> > > +
> > > + return ARM_SPE_BAD_PACKET;
> > > +}
> > > +
> > > +int arm_spe_get_packet(const unsigned char *buf, size_t len,
> > > + struct arm_spe_pkt *packet)
> > > +{
> > > + int ret;
> > > +
> > > + ret = arm_spe_do_get_packet(buf, len, packet);
> > > + if (ret > 0) {
> > > + while (ret < 1 && len > (size_t)ret && !buf[ret])
> > > + ret += 1;
> > > + }
> >
> > What is this trying to do?
>
> Nothing! I've since fixed it to prevent multiple contiguous
> PADs from coming out on their own lines, and rather accumulate up to 16
> (the width of the raw dump format) on one PAD-labeled line, like so:
>
> . 00007ec9: 00 00 00 00 00 00 00 00 00 00 PAD
>
> instead of this:
>
> . 00007ec9: 00 PAD
> . 00007eca: 00 PAD
> . 00007ecb: 00 PAD
> . 00007ecc: 00 PAD
> . 00007ecd: 00 PAD
> . 00007ece: 00 PAD
> . 00007ecf: 00 PAD
> . 00007ed0: 00 PAD
> . 00007ed1: 00 PAD
> . 00007ed2: 00 PAD
>
> thanks for pointing it out.
>
> > > +int arm_spe_pkt_desc(const struct arm_spe_pkt *packet, char *buf,
> > > + size_t buf_len)
> > > +{
> > > + int ret, ns, el, index = packet->index;
> > > + unsigned long long payload = packet->payload;
> > > + const char *name = arm_spe_pkt_name(packet->type);
> > > +
> > > + switch (packet->type) {
> > > + case ARM_SPE_BAD:
> > > + case ARM_SPE_PAD:
> > > + case ARM_SPE_END:
> > > + return snprintf(buf, buf_len, "%s", name);
> > > + case ARM_SPE_EVENTS: {
> >
> > [...]
> >
> > > + case ARM_SPE_DATA_SOURCE:
> > > + case ARM_SPE_TIMESTAMP:
> > > + return snprintf(buf, buf_len, "%s %lld", name, payload);
> > > + case ARM_SPE_ADDRESS:
> > > + switch (index) {
> > > + case 0:
> > > + case 1: ns = !!(packet->payload & NS_FLAG);
> > > + el = (packet->payload & EL_FLAG) >> 61;
> > > + payload &= ~(0xffULL << 56);
> > > + return snprintf(buf, buf_len, "%s %llx el%d ns=%d",
> > > + (index == 1) ? "TGT" : "PC", payload, el, ns);
> >
> > For TTBR1 addresses, this ends up losing the leading 0xff, giving us
> > invalid addresses, which look odd.
> >
> > Can we please sign-extend bit 55 so that this gives us valid addresses
> > regardless of TTBR?
>
> I'll take a look at doing this once I get consistent output from an
> implementation.
>
> > Could we please add a '0x' prefix to hex numbers, and use 0x%016llx so
> > that things get padded consistently?
>
> I've added the 0x prefix, but prefer to not fix the length to 016: I
> don't see any direct benefit, rather see benefits to having the length
> vary, for output size control and less obvious reasons, e.g., sorting
> address lines by their length to get a sense of address groups caught
> during the run. FWIW, Intel doesn't do the 016 either.
>
> If I've omitted a response to the other comments, it's because they are
> being addressed.

Hi Mark, I've tried to proceed as much as possible without your
response, so if you still have comments to my above comments, please
comment in-line above, otherwise review the v2 patch below?

Thanks,

Kim