Re: [PATCH v1 2/5] KVM: arm/arm64: Adjust entry/exit and trap related tracepoints

From: Zenghui Yu
Date: Fri Jun 21 2019 - 09:27:59 EST

Hi James,

sorry for the late reply.

On 2019/6/17 19:19, James Morse wrote:
Hi Zenghui,

On 13/06/2019 12:28, Zenghui Yu wrote:
On 2019/6/12 20:49, James Morse wrote:
On 12/06/2019 10:08, Zenghui Yu wrote:
Currently, we use trace_kvm_exit() to report exception type (e.g.,
"IRQ", "TRAP") and exception class (ESR_ELx's bit[31:26]) together.
But hardware only saves the exit class to ESR_ELx on synchronous
exceptions, not on asynchronous exceptions. When the guest exits
due to external interrupts, we will get tracing output like:

ÂÂÂÂ"kvm_exit: IRQ: HSR_EC: 0x0000 (UNKNOWN), PC: 0xffff87259e30"

Obviously, "HSR_EC" here is meaningless.

I assume we do it this way so there is only one guest-exit tracepoint that catches all
I don't think its a problem if user-space has to know the EC isn't set for asynchronous
exceptions, this is a property of the architecture and anything using these trace-points
is already arch specific.

Actually, *no* problem in current implementation, and I'm OK to still
keep the EC in trace_kvm_exit(). What I really want to do is adding the
EC in trace_trap_enter (the new tracepoint), will explain it later.

This patch splits "exit" and "trap" events by adding two tracepoints
explicitly in handle_trap_exceptions(). Let trace_kvm_exit() report VM
exit events, and trace_kvm_trap_exit() report VM trap events.

These tracepoints are adjusted also in preparation for supporting
'perf kvm stat' on arm64.

Because the existing tracepoints are ABI, I don't think we can change them.

We can add new ones if there is something that a user reasonably needs to trace, and can't
be done any other way.

What can't 'perf kvm stat' do with the existing trace points?

First, how does 'perf kvm stat' interact with tracepoints?

Start at the beginning, good idea. (I've never used this thing!)

We have three handlers for a specific event (e.g., "VM-EXIT") --
"is_begin_event", "is_end_event", "decode_key". The first two handlers
make use of two existing tracepoints ("kvm:kvm_exit" & "kvm:kvm_entry")
to check when the VM-EXIT events started/ended, thus the time difference
stats, event start/end time etc. can be calculated.

"is_begin_event" handler gets a *key* from the "ret" field (exit_code)
of "kvm:kvm_exit" payload, and "decode_key" handler makes use of the
*key* to find out the reason for the VM-EXIT event. Of course we should
maintain the mapping between exit_code and exit_reason in userspace.

Interpreting 'ret' is going to get tricky if we change those values on a whim. Its
internal to the KVM arch code.

Yes. We have to keep the definition of 'ret' and the mapping
(kvm_arm_exception_type) consistent between user side and kernel side.
kernel side: arch/arm64/include/asm/kvm_asm.h
user side: tools/perf/arch/arm64/util/aarch64_guest_exits.h (patch #4)

Do we have a better approach?

These are all what *patch #4* had done, #4 is a simple patch to review!

Oh, we can also set "vcpu_id_str" to achieve per vcpu event record, but
currently, we only have the "vcpu_pc" field in "kvm:kvm_entry", without
something like "vcpu_id".

Heh, so from the trace-point data, you can't know which on is vcpu-0 and which is vcpu-1.


OK, next comes the more important question - what should/can we do to
the tracepoints in preparation of 'perf kvm stat' on arm64?

From the article you've provided, it's clear that we can't remove the EC
from trace_kvm_exit(). But can we add something like "vcpu_id" into
(at least) trace_kvm_entry(), just like what this patch has done?

Adding something is still likely to break a badly written user-space that is trying to
parse the trace information. A regex picking out the last argument will now get a
different value.

If not, which means we have to keep the existing tracepoints totally
unchanged, then 'perf kvm stat' will have no way to record/report per
vcpu VM-EXIT events (other arch like X86, powerpc, s390 etc. have this
capability, if I understand it correctly).

Well, you get the events, but you don't know which vCPU is which. You can map this back to
the pid of the host thread assuming user-space isn't moving vcpu between host threads.

If we're really stuck: Adding tracepoints to KVM-core's vcpu get/put, that export the
vcpu_id would let you map pid->vcpu_id, which you can then use for the batch of enter/exit
events that come before a final vcpu put.
grepping "vpu_id" shows perf has a mapping for which arch-specific argument in enter/exit
is the vcpu-id. Done with this core-code mapping, you could drop that code...

Yes. In the current 'perf kvm' implementation, we make use of the
"vcpu id" field (specified by vcpu_id_str, differ between arch) of
"kvm:kvm_entry" tracepoint payload, to achieve per vcpu record/report.
(Details in per_vcpu_record() in tools/perf/builtin-kvm.c)

Adding tracepoints in vcpu_load()/vcpu_put() might be a good idea, we
can get "vcpu id" information without breaking existing tracepoint ABI.
And I think other arch will benefit from this way, for those who haven't
supported 'perf kvm stat' yet and don't have "vcpu id" info in the
"kvm:kvm_entry" tracepoint.

But there is at least one problem (as I can see)... KVM will not always
do a vcpu_load()/vcpu_put(), while guest always enters/exits.
In the worst case, if KVM can always handle guest's exits, perf will not
have a chance to catch vcpu_load()/vcpu_put(). And still, we fail to get
per vcpu statistical analysis.

I've written a simple patch to prove this, attached at the end of this
mail. I run 'perf kvm stat record/report' against a 4-VCPU guest, but
only vcpu-3 will be reported, as only vcpu-3 is doing some MMIO access
and we have to return to user space (and vcpu_load() will be invoked).

I'm not sure if I (likely) missed some important points you've provided.

But I'd be a little nervous adding a new trace-point to work around an ABI problem, as we
may have just moved the ABI problem! (What does a user of a vcpu_put tracepoint really need?)

As for TRAP events, should we consider adding two new tracepoints --
"kvm_trap_enter" and "kvm_trap_exit", to keep tracking of the trap
handling process? We should also record the EC in "kvm_trap_enter", which will be used as
*key* in TRAP event's "is_begin_event" handler.

The EC can't change between trace_kvm_exit() and handle_exit(), so you already have this.

What are the 'trap' trace points needed for? You get the timing and 'exception class' from
the guest enter/exit tracepoints. What about handle_exit() can't you work out from this?

In short, these two trap tracepoints are for patch #5.
Some users (me) may want to know, how many traps a guest has in a given
period of time, how many times a particular type of trap (e.g., Dabort)
has occurred, and how long each trap has been processed. That's what
patch #5 does, you can get these info through 'perf kvm stat report

If we use guest enter/exit tracepoints to support TRAP events, there
will be two problems:
1) the timing becomes inaccurate (longer than the real timing)
2) the reported 'EC' becomes "Unknown" in the case when guest exits
due to IRQ, as we set 'esr_ec' to 0 on asynchronous exceptions,
in trace_kvm_exit.

Patch #5 tells us the whole story, it's simple too.

(I only skimmed the perf patches, I'll go back now that I know a little more about what
you're doing)

Much appreciated :)

What do you suggest?

We can explore the vcpu_load()/vcpu_put() trace idea, (it may not work for some other
reason). I'd like to understand what the 'trap' tracepoints are needed for.

Thanks for your suggestion!

Btw, I have collected some useful links, attached here, to share with
someone who're interested in this series.

[1] the initial idea for 'perf kvm stat':
[2] a discussion on improving kvm_exit tracepoint:



Two tracepoints have already been added in vcpu_load/put().

tools/perf/arch/arm64/util/kvm-stat.c | 2 ++
tools/perf/builtin-kvm.c | 7 ++++++-
2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/tools/perf/arch/arm64/util/kvm-stat.c b/tools/perf/arch/arm64/util/kvm-stat.c
index 41b28de..a91fe7d 100644
--- a/tools/perf/arch/arm64/util/kvm-stat.c
+++ b/tools/perf/arch/arm64/util/kvm-stat.c
@@ -103,6 +103,8 @@ static void trap_event_decode_key(struct perf_kvm_stat *kvm __maybe_unused,
+ "kvm:vcpu_load",
+ "kvm:vcpu_put",

diff --git a/tools/perf/builtin-kvm.c b/tools/perf/builtin-kvm.c
index dbb6f73..f011fcf 100644
--- a/tools/perf/builtin-kvm.c
+++ b/tools/perf/builtin-kvm.c
@@ -81,6 +81,11 @@ bool exit_event_begin(struct perf_evsel *evsel,
return false;

+static bool kvm_vcpu_load_event(struct perf_evsel *evsel)
+ return !strcmp(evsel->name, "kvm:vcpu_load");
bool kvm_entry_event(struct perf_evsel *evsel)
return !strcmp(evsel->name, kvm_entry_trace);
@@ -400,7 +405,7 @@ struct vcpu_event_record *per_vcpu_record(struct thread *thread,
struct perf_sample *sample)
/* Only kvm_entry records vcpu id. */
- if (!thread__priv(thread) && kvm_entry_event(evsel)) {
+ if (!thread__priv(thread) && kvm_vcpu_load_event(evsel)) {
struct vcpu_event_record *vcpu_record;

vcpu_record = zalloc(sizeof(*vcpu_record));