Re: [PATCH v1 2/5] KVM: arm/arm64: Adjust entry/exit and trap related tracepoints

From: James Morse
Date: Mon Jun 17 2019 - 07:24:30 EST

Hi Zenghui,

On 13/06/2019 12:28, Zenghui Yu wrote:
> On 2019/6/12 20:49, James Morse wrote:
>> On 12/06/2019 10:08, Zenghui Yu wrote:
>>> Currently, we use trace_kvm_exit() to report exception type (e.g.,
>>> "IRQ", "TRAP") and exception class (ESR_ELx's bit[31:26]) together.
>>> But hardware only saves the exit class to ESR_ELx on synchronous
>>> exceptions, not on asynchronous exceptions. When the guest exits
>>> due to external interrupts, we will get tracing output like:
>>> ÂÂÂÂ"kvm_exit: IRQ: HSR_EC: 0x0000 (UNKNOWN), PC: 0xffff87259e30"
>>> Obviously, "HSR_EC" here is meaningless.

>> I assume we do it this way so there is only one guest-exit tracepoint that catches all
>> exits.
>> I don't think its a problem if user-space has to know the EC isn't set for asynchronous
>> exceptions, this is a property of the architecture and anything using these trace-points
>> is already arch specific.

> Actually, *no* problem in current implementation, and I'm OK to still
> keep the EC in trace_kvm_exit(). What I really want to do is adding the
> EC in trace_trap_enter (the new tracepoint), will explain it later.

>>> This patch splits "exit" and "trap" events by adding two tracepoints
>>> explicitly in handle_trap_exceptions(). Let trace_kvm_exit() report VM
>>> exit events, and trace_kvm_trap_exit() report VM trap events.
>>> These tracepoints are adjusted also in preparation for supporting
>>> 'perf kvm stat' on arm64.
>> Because the existing tracepoints are ABI, I don't think we can change them.
>> We can add new ones if there is something that a user reasonably needs to trace, and can't
>> be done any other way.
>> What can't 'perf kvm stat' do with the existing trace points?

> First, how does 'perf kvm stat' interact with tracepoints?

Start at the beginning, good idea. (I've never used this thing!)

> We have three handlers for a specific event (e.g., "VM-EXIT") --
> "is_begin_event", "is_end_event", "decode_key". The first two handlers
> make use of two existing tracepoints ("kvm:kvm_exit" & "kvm:kvm_entry")
> to check when the VM-EXIT events started/ended, thus the time difference
> stats, event start/end time etc. can be calculated.

> "is_begin_event" handler gets a *key* from the "ret" field (exit_code)
> of "kvm:kvm_exit" payload, and "decode_key" handler makes use of the
> *key* to find out the reason for the VM-EXIT event. Of course we should
> maintain the mapping between exit_code and exit_reason in userspace.

Interpreting 'ret' is going to get tricky if we change those values on a whim. Its
internal to the KVM arch code.

> These are all what *patch #4* had done, #4 is a simple patch to review!

> Oh, we can also set "vcpu_id_str" to achieve per vcpu event record, but
> currently, we only have the "vcpu_pc" field in "kvm:kvm_entry", without
> something like "vcpu_id".

Heh, so from the trace-point data, you can't know which on is vcpu-0 and which is vcpu-1.

> OK, next comes the more important question - what should/can we do to
> the tracepoints in preparation of 'perf kvm stat' on arm64?
> From the article you've provided, it's clear that we can't remove the EC
> from trace_kvm_exit(). But can we add something like "vcpu_id" into
> (at least) trace_kvm_entry(), just like what this patch has done?

Adding something is still likely to break a badly written user-space that is trying to
parse the trace information. A regex picking out the last argument will now get a
different value.

> If not, which means we have to keep the existing tracepoints totally
> unchanged, then 'perf kvm stat' will have no way to record/report per
> vcpu VM-EXIT events (other arch like X86, powerpc, s390 etc. have this
> capability, if I understand it correctly).

Well, you get the events, but you don't know which vCPU is which. You can map this back to
the pid of the host thread assuming user-space isn't moving vcpu between host threads.

If we're really stuck: Adding tracepoints to KVM-core's vcpu get/put, that export the
vcpu_id would let you map pid->vcpu_id, which you can then use for the batch of enter/exit
events that come before a final vcpu put.
grepping "vpu_id" shows perf has a mapping for which arch-specific argument in enter/exit
is the vcpu-id. Done with this core-code mapping, you could drop that code...

But I'd be a little nervous adding a new trace-point to work around an ABI problem, as we
may have just moved the ABI problem! (What does a user of a vcpu_put tracepoint really need?)

> As for TRAP events, should we consider adding two new tracepoints --
> "kvm_trap_enter" and "kvm_trap_exit", to keep tracking of the trap
> handling process? We should also record the EC in "kvm_trap_enter", which will be used as
> *key* in TRAP event's "is_begin_event" handler.

The EC can't change between trace_kvm_exit() and handle_exit(), so you already have this.

What are the 'trap' trace points needed for? You get the timing and 'exception class' from
the guest enter/exit tracepoints. What about handle_exit() can't you work out from this?

> Patch #5 tells us the whole story, it's simple too.

(I only skimmed the perf patches, I'll go back now that I know a little more about what
you're doing)

> What do you suggest?

We can explore the vcpu_load()/vcpu_put() trace idea, (it may not work for some other
reason). I'd like to understand what the 'trap' tracepoints are needed for.