I agree that we need keep things simple.First, the guest already knows how to deal with per-cpu performance
This design is to deal with a task context perf collection in guest os.
Scenario 1:
1) guest os starts to collect statistics of process A on vcpu 0;
2) process A is scheduled to vcpu 1. Then, the perf_event at host side need
to be moved to VCPU 1 's thread. With the per KVM instance design, we needn't
move host_perf_shadow among vcpus.
monitors, since that's how most (all) hardware works. So we aren't
making the guest more complex, and on the other hand we simplify the host.
Second, if process A is migrated, and the guest uses per-processAgree. My patches do so.
counters, the guest will need to stop/start the counter during the
migration. This will cause the host to migrate the counter,
Question: Where does host migrate the counter?
The perf event at host side is bound to specific vcpu thread.
so while weDisagree here. If process A on vcpu 0 in guest os is migrated to vcpu 1,
didn't move the counter to a different vcpu,
host has to move process A's perf_event to vcpu 1 thread.
we still have to move it toWhat does 'touch' mean here?
a different cpu.
Scenario 2:With per-vcpu events, this can't happen. Each vcpu has its own set of
1) guest os creates a perf_event at host side on vcpu 0;
2) malicious guest os calls close to delete the host perf_event on vcpu 1, but
enables the perf_event on vcpu0 at the same time. When close thread runs to get the
host_perf_shadow from the list, enable thread also gets it. Then, close thread
deletes the perf_event, and enable thread will cause host kernel panic when using
host_perf_shadow.
perf events in their own ID namespace. vcpu 0 can't touch vcpu 1's
events even if it knows their IDs.
With the task (process) context event in guest os, we have to migrate the event
among vcpus when guest process scheduler balances processes among vcpus. So all
events in a guest os instance uses the same ID name space.
What you mentioned is really right when the event is cpu context event, but
not with task context event.
We don't disable counters during guest->host switch. Generic perf codes willWe use a mutex_trylock in NMI hanlder. If it can't get the lock, there is a NMI missI see. I don't think this is needed if we disable the counters during
happening, but host kernel still updates perf_event->host_perf_shadow.counter, so the
overflow data will be accumulated.
guest->host switch,
disable it when:
1) host kernel process scheduler schedules the task out if the event is a task
context event.
2) guest os calls DISABLE hypercall.
we can just copy the data and set a bit inWe can use a bit of vcpu->requests, but it has no far difference from a simple
vcpu->requests so that we can update the guest during next entry.
checking (vcpu->arch.overflows == 0) in function kvm_sync_events_to_guest.
The key is how to save pending perf_event pointers, so kvm could update
their data to guest. vcpu->arch.overflow_events does so.
The guest NMI handlers and callbacks are allIn guest NMI and callbacks are serialized on a specific perf event, but perf_close
serialized by the guest itself.
isn't. perf generic codes have good consideration on it. Under guest/host environment,
we need new lock to coordinate it.
In addition, pls. consider a malicious guest os who might doesn't serialize
the operations to try to cause host kernel panic.
Again, per-vcpu event couldn't work well with task context event in guest.This goes away with per-vcpu events.This is to fight with malicious guest os kernel. Just like what I mention above,
the race might happen when:
1) NMI handler accesses it;
2) vmx_handle_exit codes access overflow_events to sync data to guest os;
3) Another vcpu thread of the same guest os calls close to delete the perf_event;
If we move it to per-vcpu, host kernel need move the entry (struct host_perf_shadow)struct kvm_arch {Need to be per-vcpu. Also wrap in a kvm_vcpu_perf structure, the names
struct kvm_mem_aliases *aliases;
@@ -415,6 +431,15 @@ struct kvm_arch {
/* fields used by HYPER-V emulation */
u64 hv_guest_os_id;
u64 hv_hypercall;
+
+ /*
+ * fields used by PARAVIRT perf interface:
+ * Used to organize all host perf_events representing guest
+ * perf_event on a specific kvm instance
+ */
+ atomic_t kvm_pv_event_num;
+ spinlock_t shadow_lock;
+ struct list_head *shadow_hash_table;
are very generic.
among vcpu when an event be migrated to another vcpu. That causes codes a little complicated
and might introduce something like deadlock or new race.
With my implementation, there is
only one potential performance issue as there are some lock contention on shadow_lock.
But the performance issue is not severe, because:
1) guest os doesn't support too many vcpu (usually no larger than 8);
There are 2 things.This limitation is different from hardware PMU counter imitation. When any application orIn practice, it will take such a long time to cycle through all events
guest os vcpu thread creates perf_event, host kernel has no limitation. Kernel just arranges
all perf_event in a list (not considering group case) and schedules them to PMU hardware
by a round-robin method.
that measurement quality will deteriorate.
1) We provide a good capability to support applications to submit more events;
2) Application just use a small group of event, typically one cpu context event per vcpu.
They are very different. As usual case about 2), the measurement quality wouldn't
deteriorate.
I prefer exposing a muchGuest os of linux does schedule them. By default, guest kernel enables
smaller number of events so that multiplexing on the host side will be
rare (for example, if both guest and host are monitoring at the same
time, or to accomodate hardware constraints). If the guest needs 1024
events, it can schedule them itself (and then it knows the measurement
is very inaccurate due to sampling)
X86_PMC_IDX_MAX (64) events on a specific vpcu at the same time. See
function kvm_pmu_enable and kvm_add_event.
Exposing is different from disable/enable. If we expose/hide events when
enable/disable events, it consumes too much cpu resources as host kernel need to
create/delete the events frequently.
1024 is just the upper limitations that host could _create_ perf_event for
the guest os instance. If guest os is linux, most active events in host
for this guest os is VCPU_NUM*64.
How to process the failure? Kill the guest os? :)As host kernel saves/accumulate data in perf_event->host_perf_shadow.counter,Next time we may fail too. And next time as well.
it doesn't matter to have one failure. next time when overflowing again, it will
copy all data back to guest os.
Frankly, I used per-vcpu in the beginning, but move to per-kvm after checkingWell, without per-vcpu events, you can't guarantee this.It doesn't matter. There is only one potential race between host kernel and
guest kernel. When guest vmexits to host, it wouldn't access data pointed by
shadow->guest_event_addr. Above kvm_write_guest happens with the same vpcu.
So we just need make sure guest os vcpu accesses guest_perf_shadow->counter.overflows
atomically.
With per-vcpu events, I agree.
every possible race issues.
1) IIUC exclude_user and exclude_kernel should just work. They work byGood pointer! Let me do some experiments to make sure it does work.
counting only when the cpl matches, yes? The hardware cpl is available
and valid in the guest.
2) We should atomically enable/disable the hardware performance counterI once checked it. At least under the vcpu thread context after vmexit, host
during guest entry/exit, like during ordinary context switch, so that
the guest doesn't measure host code (for example, ip would be
meaningless).
code execution is for the guest. It's reasonable to count this part. If the
vcpu thread is scheduled out, host kernel disables all events binding to this vcpu
thread. ip is meaningless if NMI happens in host code path, but host kernel would
accumulate the overflow count into host_perf_shadow->counter.overflows. Next time
when NMI happens in guest os, host kernel inject NMI guest kernel, so guest uses
that pt_regs->ip.