[RFC PATCH] perf/core: don't sample kernel regs upon skid

From: Mark Rutland
Date: Mon Jul 02 2018 - 11:13:00 EST

Users can request that general purpose registers, instruction pointer,
etc, are sampled when a perf event counter overflows. To try to avoid
this resulting in kernel state being leaked, unprivileged users are
usually forbidden from opening events which count while the kernel is

Unfortunately, this is not sufficient to avoid leading kernel state.

For various reasons, there can be a delay between the overflow occurring
and the resulting overflow exception (e.g. an NMI) being taken. During
this window, other instructions may be executed, resulting in skid.

This skid means that a userspace-only event overflowing may result in an
exception being taken *after* entry to the kernel, allowing kernel
registers to be sampled. Depending on the amount of skid, this may only
leak the PC (breaking KASLR), or it may leak secrets which are currently
live in GPRs.

Let's avoid this by only sampling from the user registers when an event
is supposed to exclude the kernel, providing the illusion that the
overflow exception is taken from userspace.

We also have similar cases when sampling a guest, where we get the host
regs in some cases. It's not entirely clear to me how we should handle

Signed-off-by: Mark Rutland <mark.rutland@xxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Jin Yao <yao.jin@xxxxxxxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
kernel/events/core.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8f0434a9951a..2ab2548b2e66 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6361,6 +6361,32 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
return callchain ?: &__empty_callchain;

+static struct pt_regs *perf_get_sample_regs(struct perf_event *event,
+ struct pt_regs *regs)
+ /*
+ * Due to interrupt latency (AKA "skid"), we may enter the kernel
+ * before taking an overflow, even if the PMU is only counting user
+ * events.
+ *
+ * If we're not counting kernel events, always use the user regs when
+ * sampling.
+ *
+ * TODO: what do we do about sampling a guest's registers? The IP is
+ * special-cased, but for the rest of the regs they'll get the
+ * user/kernel regs depending on whether exclude_kernel is set, which
+ * is nonsensical.
+ *
+ * We can't get at the full set of regs in all cases (e.g. Xen's PV PMU
+ * can't provide the GPRs), so should we just zero the GPRs when in a
+ * guest? Or skip outputting the regs in perf_output_sample?
+ */
+ if (event->attr.exclude_kernel && !user_mode(regs))
+ return task_pt_regs(current);
+ return regs;
void perf_prepare_sample(struct perf_event_header *header,
struct perf_sample_data *data,
struct perf_event *event,
@@ -6368,6 +6394,8 @@ void perf_prepare_sample(struct perf_event_header *header,
u64 sample_type = event->attr.sample_type;

+ regs = perf_get_sample_regs(event, regs);
header->type = PERF_RECORD_SAMPLE;
header->size = sizeof(*header) + event->header_size;