Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read

From: Liang, Kan
Date: Tue Jan 30 2018 - 09:41:47 EST




On 1/30/2018 4:16 AM, Stephane Eranian wrote:
Hi,

On Mon, Jan 29, 2018 at 8:29 AM, <kan.liang@xxxxxxxxxxxxxxx> wrote:

From: Kan Liang <kan.liang@xxxxxxxxxxxxxxx>

------

Changes since V2:
- Refined the changelog
- Introduced specific read function for large PEBS.
The previous generic PEBS read function is confusing.
Disabled PMU in pmu::read() path for large PEBS.
Handled the corner case when reload_times == 0.
- Modified the parameter of intel_pmu_save_and_restart_reload()
Discarded local64_cmpxchg
- Added fixes tag
- Added WARN to handle reload_times == 0 || reload_val == 0

Changes since V1:
- Check PERF_X86_EVENT_AUTO_RELOAD before call
intel_pmu_save_and_restore()

It is not yet clear to me why PERF_SAMPLE_PERIOD is not allowed
with large PEBS. Large PEBS requires fixed period. So the kernel could
make up the period from the event and store it in the sampling buffer.

The PERF_SAMPLE_PERIOD will be implicitly set in freq mode.
By now, large PEBS doesn't support freq mode yet.


I tried using large PEBS recently, and despite trying different option
combination of perf record, I was not able to get it to work.

$ perf record -c 1 -e cpu/event=0xd1,umask=0x10,period=19936/pp
--no-timestamp --no-period -a -C 0

But I was able to make this work with a much older kernel.

Is there any error message?


Another annoyance I ran into is with perf record requiring -c period
in order not to set
PERF_SAMPLE_PERIOD in the event.

If I do:
perf record -c 1 -e cpu/event=0xd1,umask=0x10,period=19936/pp
--no-timestamp --no-period -a -C 0

I get

perf_event_attr:
type 4
size 112
config 0x10d1
{ sample_period, sample_freq } 199936
sample_type IP|TID|CPU

But if I do:
perf record -e cpu/event=0xd1,umask=0x10,period=19936/pp
--no-timestamp --no-period -a -C 0

I get

perf_event_attr:
type 4
size 112
config 0x10d1
{ sample_period, sample_freq } 199936
sample_type IP|TID|CPU|PERIOD

Perf should check if all events have a period=, then it should not
pass PERF_SAMPLE_PERIOD, even
more so when only one event is defined.

As my understanding, the "period=" is per-event term. It should not change the global setting.
I think it's better still use "--no-period" to control the PERIOD bit.


Also it does not seem to honor --no-period.


Right, perf tool doesn't clear the PERIOD for --no-period.
if (opts->period)
perf_evsel__set_sample_bit(evsel, PERIOD);

I will submit a patch to fix it.

All of this makes it hard to use large PEBS in general.

So I think your series should also address this part.

The issue looks like perf tool problem. I will take a look at it, but probably address it in a separate patch set.

Thanks,
Kan

- Introduce a special purpose intel_pmu_save_and_restart()
just for AUTO_RELOAD.
- New patch to disable userspace RDPMC usage if large PEBS is enabled.

------

There is a bug when mmap read event->count with large PEBS enabled.
Here is an example.
#./read_count
0x71f0
0x122c0
0x1000000001c54
0x100000001257d
0x200000000bdc5

The bug is caused by two issues.
- In x86_perf_event_update, the calculation of event->count does not
take the auto-reload values into account.
- In x86_pmu_read, it doesn't count the undrained values in large PEBS
buffers.

The first issue was introduced with the auto-reload mechanism enabled
since commit 851559e35fd5 ("perf/x86/intel: Use the PEBS auto reload
mechanism when possible")

Patch 1 fixed the issue in x86_perf_event_update.

The second issue was introduced since commit b8241d20699e
("perf/x86/intel: Implement batched PEBS interrupt handling (large PEBS
interrupt threshold)")

Patch 2-4 fixed the issue in x86_pmu_read.

Besides the two issues, the userspace RDPMC usage is broken for large
PEBS as well.
The RDPMC issue was also introduced since commit b8241d20699e
("perf/x86/intel: Implement batched PEBS interrupt handling (large PEBS
interrupt threshold)")

Patch 5 fixed the RDPMC issue.

The source code of read_count is as below.

struct cpu {
int fd;
struct perf_event_mmap_page *buf;
};

int perf_open(struct cpu *ctx, int cpu)
{
struct perf_event_attr attr = {
.type = PERF_TYPE_HARDWARE,
.size = sizeof(struct perf_event_attr),
.sample_period = 100000,
.config = 0,
.sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_TID |
PERF_SAMPLE_TIME | PERF_SAMPLE_CPU,
.precise_ip = 3,
.mmap = 1,
.comm = 1,
.task = 1,
.mmap2 = 1,
.sample_id_all = 1,
.comm_exec = 1,
};
ctx->buf = NULL;
ctx->fd = syscall(__NR_perf_event_open, &attr, -1, cpu, -1, 0);
if (ctx->fd < 0) {
perror("perf_event_open");
return -1;
}
return 0;
}

void perf_close(struct cpu *ctx)
{
close(ctx->fd);
if (ctx->buf)
munmap(ctx->buf, pagesize);
}

int main(int ac, char **av)
{
struct cpu ctx;
u64 count;

perf_open(&ctx, 0);

while (1) {
sleep(5);

if (read(ctx.fd, &count, 8) != 8) {
perror("counter read");
break;
}
printf("0x%llx\n", count);

}
perf_close(&ctx);
}

Kan Liang (5):
perf/x86/intel: fix event update for auto-reload
perf/x86: introduce read function for x86_pmu
perf/x86/intel/ds: introduce read function for large pebs
perf/x86/intel: introduce read function for intel_pmu
perf/x86: fix: disable userspace RDPMC usage for large PEBS

arch/x86/events/core.c | 5 ++-
arch/x86/events/intel/core.c | 9 +++++
arch/x86/events/intel/ds.c | 85 ++++++++++++++++++++++++++++++++++++++++++--
arch/x86/events/perf_event.h | 3 ++
4 files changed, 99 insertions(+), 3 deletions(-)

--
2.7.4