Re: [PATCH v1 2/2] perf/core: Fake regs for leaked kernel samples

From: Jin, Yao
Date: Fri Aug 07 2020 - 01:23:15 EST

Hi Peter,

On 8/6/2020 5:18 PM, peterz@xxxxxxxxxxxxx wrote:
On Thu, Aug 06, 2020 at 10:26:29AM +0800, Jin, Yao wrote:

+static struct pt_regs *sanitize_sample_regs(struct perf_event *event, struct pt_regs *regs)
+ struct pt_regs *sample_regs = regs;
+ /* user only */
+ if (!event->attr.exclude_kernel || !event->attr.exclude_hv ||
+ !event->attr.exclude_host || !event->attr.exclude_guest)
+ return sample_regs;

Is this condition correct?

Say counting user event on host, exclude_kernel = 1 and exclude_host = 0. It
will go "return sample_regs" path.

I'm not sure, I'm terminally confused on virt stuff.

Suppose we have nested virt:


And we're running in G0, then:

- 'exclude_hv' would exclude L0 events
- 'exclude_host' would ... exclude L1-hv events?

I think the exclude_host is generally set by guest (/arch/x86/kvm/pmu.c, pmc_reprogram_counter).

If G0 is a host, if we set exclude_host in G0, I think we will not be able to count the events on G0.

The appropriate usage is, G1 sets the exclude_host, then the events on G0 will not be collected by guest G1.

That's my understanding for the usage of exclude_host.

- 'exclude_guest' would ... exclude G1 events?

Similarly, the appropriate usage is, the host (G0) sets the exclude_guest, then the events on G1 will not be collected by host G0.

If G1 sets exclude_guest, since no guest is under G1, that's ineffective.

Then the next question is, if G0 is a host, does the L1-hv run in
G0 userspace or G0 kernel space?

I'm not very sure. Maybe some in kernel, some in userspace(qemu)? Maybe some KVM experts can help to answer this question.

I was assuming G0 userspace would not include anything L1 (kvm is a
kernel module after all), but what do I know.

I have tested following conditions in native environment (not in KVM guests), the result is not expected.

/* user only */
if (!event->attr.exclude_kernel || !event->attr.exclude_hv ||
!event->attr.exclude_host || !event->attr.exclude_guest)
return sample_regs;

perf record -e cycles:u ./div
perf report --stdio

# Overhead Command Shared Object Symbol
# ........ ....... ................ .......................
49.51% div [.] __random_r
33.93% div [.] __random
8.13% div [.] rand
4.29% div div [.] main
4.14% div div [.] rand@plt
0.00% div [unknown] [k] 0xffffffffbd600cb0
0.00% div [unknown] [k] 0xffffffffbd600df0
0.00% div [.] _dl_relocate_object
0.00% div [.] _dl_start
0.00% div [.] _start

0xffffffffbd600cb0 and 0xffffffffbd600df0 are leaked kernel addresses.

From debug, I can see:

[ 6272.320258] jinyao: sanitize_sample_regs: event->attr.exclude_kernel = 1, event->attr.exclude_hv = 1, event->attr.exclude_host = 0, event->attr.exclude_guest = 0

So it goes "return sample_regs;" path.

@@ -11609,7 +11636,8 @@ SYSCALL_DEFINE5(perf_event_open,
if (err)
return err;
- if (!attr.exclude_kernel) {
+ if (!attr.exclude_kernel || !attr.exclude_callchain_kernel ||
+ !attr.exclude_hv || !attr.exclude_host || !attr.exclude_guest) {
err = perf_allow_kernel(&attr);
if (err)
return err;

I can understand the conditions "!attr.exclude_kernel || !attr.exclude_callchain_kernel".

But I'm not very sure about the "!attr.exclude_hv || !attr.exclude_host || !attr.exclude_guest".

Well, I'm very sure G0 userspace should never see L0 or G1 state, so
exclude_hv and exclude_guest had better be true.

On host, exclude_hv = 1, exclude_guest = 1 and exclude_host = 0, right?

Same as above, is G0 host state G0 userspace?

So even exclude_kernel = 1 but exclude_host = 0, we will still go
perf_allow_kernel path. Please correct me if my understanding is wrong.

Yes, because with those permission checks in place it means you have
permission to see kernel bits.

At the syscall entry, I also added some printk.

Aug 7 03:37:40 kbl-ppc kernel: [ 854.688045] syscall: attr.exclude_kernel = 1, attr.exclude_callchain_kernel = 0, attr.exclude_hv = 0, attr.exclude_host = 0, attr.exclude_guest = 0

For my test case ("perf record -e cycles:u ./div"), the perf_allow_kernel() is also executed.

Jin Yao