Re: [RFC] para virt interface of perf to support kvm guest osstatistics collection in guest os
From: Zhang, Yanmin
Date: Wed Jun 09 2010 - 05:19:22 EST
On Wed, 2010-06-09 at 11:33 +0300, Avi Kivity wrote:
> On 06/09/2010 06:30 AM, Zhang, Yanmin wrote:
> > From: Zhang, Yanmin<yanmin_zhang@xxxxxxxxxxxxxxx>
> >
> > Based on Ingo's idea, I implement a para virt interface for perf to support
> > statistics collection in guest os. That means we could run tool perf in guest
> > os directly.
> >
> > Great thanks to Peter Zijlstra. He is really the architect and gave me architecture
> > design suggestions. I also want to thank Yangsheng and LinMing for their generous
> > help.
> >
> > The design is:
> >
> > 1) Add a kvm_pmu whose callbacks mostly just calls hypercall to vmexit to host kernel;
> > 2) Create a host perf_event per guest perf_event;
> > 3) Host kernel syncs perf_event count/overflows data changes to guest perf_event
> > when processing perf_event overflows after NMI arrives. Host kernel inject NMI to guest
> > kernel if a guest event overflows.
> > 4) Guest kernel goes through all enabled event on current cpu and output data when they
> > overflows.
> > 5) No change in user space.
> >
>
Thanks for your kind comments.
> One thing that's missing is documentation of the guest/host ABI. It
> will be a requirement for inclusion, but it will also be a great help
> for review, so please provide it ASAP.
I will add such document. It will includes:
1) Data struct perf_event definition. Guest os and host os have to share the same
data structure as host kernel will sync data (counte/overflows and others if needed)
changes back to guest os.
2) A list of all hypercalls
3) Guest need have NMI handler which checks all guest events.
>
> Please also split the guest and host parts into separate patches.
I will do so.
>
> >
> > -#define KVM_MAX_MMU_OP_BATCH 32
> >
>
> Keep that please.
I will do so.
>
> > +/* Operations for KVM_PERF_OP */
> > +#define KVM_PERF_OP_OPEN 1
> > +#define KVM_PERF_OP_CLOSE 2
> > +#define KVM_PERF_OP_ENABLE 3
> > +#define KVM_PERF_OP_DISABLE 4
> > +#define KVM_PERF_OP_START 5
> > +#define KVM_PERF_OP_STOP 6
> > +#define KVM_PERF_OP_READ 7
> >
>
> Where is the hypercall number for the perf hypercall?
I defines it in file kvm_para.h like KVM_HC_MMU_OP.
>
> > +static bool kvm_reserve_pmc_hardware(void)
> > +{
> > + if (nmi_watchdog == NMI_LOCAL_APIC)
> > + disable_lapic_nmi_watchdog();
> > +
> > + return true;
> > +}
> > +
> > +static void kvm_release_pmc_hardware(void)
> > +{
> > + if (nmi_watchdog == NMI_LOCAL_APIC)
> > + enable_lapic_nmi_watchdog();
> > +}
> >
>
> Disabling the watchdog is unfortunate. Why is it necessary?
perf always uses NMI, so we disable the nmi_watchdog when a perf_event is
set up in case they might have impact.
>
> > +
> > +static int kvm_pmu_enable(struct perf_event *event)
> > +{
> > + int ret;
> > + unsigned long addr = __pa(event);
> > +
> > + if (kvm_add_event(event))
> > + return -1;
> > +
> > + ret = kvm_hypercall3(KVM_PERF_OP, KVM_PERF_OP_ENABLE,
> > + addr, (unsigned long) event->shadow);
> >
>
> This is suspicious. Passing virtual addresses to the host is always
> problematic. Or event->shadow only used as an opaque cookie?
I use perf_event->shadow at both host and guest side.
1) At host side, perf_event->shadow points to an area including the page
mapping information about guest perf_event. Host kernel uses it to sync data
changes back to guest;
2) At guest side, perf_event->shadow save the virtual address of host
perf_event at host side. At guest side,
kvm_hypercall3(KVM_PERF_OP, KVM_PERF_OP_OPEN, ...) return the virtual address.
Guest os shouldn't use it but using it to pass it back to host kernel in following
hypercalls. It might be a security issue for host kernel. Originally, I planed guest
os not to use perf_event->shadow. Host kernel maintains a per-vcpu event-related
list whose key is addr of guest perf_event. But considering the vcpu thread migration
on logical cpu, such list needs lock and implementation becomes a little complicated.
I will double-check the list method as the security issue is there.
>
> > +int __init kvm_init_hw_perf_events(void)
> > +{
> > + if (!kvm_para_available())
> > + return -1;
> > +
> > + x86_pmu.handle_irq = kvm_default_x86_handle_irq;
> > +
> > + pr_cont("KVM PARA PMU driver.\n");
> > + register_die_notifier(&kvm_perf_event_nmi_notifier);
> > +
> > + return 0;
> > +}
> >
>
> Need to detect the kvm pmu via its own cpuid bit.
Ok. I will add a feature, KVM_FEATURE_PARA_PERF, something like
bit KVM_FEATURE_CLOCKSOURCE.
>
> > +
> > +static int __kvm_hw_perf_event_init(struct perf_event *event)
> > +{
> > + struct hw_perf_event *hwc =&event->hw;
> > + int err;
> > + unsigned long result;
> > + unsigned long addr;
> > +
> > + err = 0;
> > + if (!atomic_inc_not_zero(&active_events)) {
> > + mutex_lock(&pmc_reserve_mutex);
> > + if (atomic_read(&active_events) == 0) {
> > + if (!kvm_reserve_pmc_hardware())
> > + err = -EBUSY;
> > + }
> > + if (!err)
> > + atomic_inc(&active_events);
> > + mutex_unlock(&pmc_reserve_mutex);
> > + if (err)
> > + return err;
> > + }
> > +
> > + event->destroy = kvm_hw_perf_event_destroy;
> > +
> > + hwc->idx = -1;
> > + hwc->last_cpu = -1;
> > + hwc->last_tag = ~0ULL;
> > +
> > + addr = __pa(event);
> > + result = kvm_hypercall3(KVM_PERF_OP, KVM_PERF_OP_OPEN, addr, 0);
> > +
> > + if (result)
> > + event->shadow = (void *) result;
> > + else
> > + err = -1;
> >
>
> Ok, so event->shadow is never dereferenced. In that case better not
> make it a pointer at all, keep it unsigned long.
Host kernel also uses it.
>
> Note that you can run 32-bit guests on 64-bit hosts, so the cookie
> better not exceed 32 bits.
I forgot 32 bits. I need double-check it. It seems I have to implement a list
at host side and don't use it at guest side any more.
>
> > +
> > +static void kvm_sync_event_to_guest(struct perf_event *event, int overflows)
> > +{
> > + struct hw_perf_event *hwc =&event->hw;
> > + struct kvmperf_event *kvmevent;
> > + int offset, len, data_len, copied, page_offset;
> > + struct page *event_page;
> > + void *shared_kaddr;
> > +
> > + kvmevent = event->shadow;
> > + offset = kvmevent->event_offset;
> > +
> > + /* Copy perf_event->count firstly */
> > + offset += offsetof(struct perf_event, count);
> > + if (offset< PAGE_SIZE) {
> > + event_page = kvmevent->event_page;
> > + page_offset = offset;
> > + } else {
> > + event_page = kvmevent->event_page2;
> > + page_offset = offset - PAGE_SIZE;
> > + }
> > + shared_kaddr = kmap_atomic(event_page, KM_USER0);
> > + *((atomic64_t *)(shared_kaddr + page_offset)) = event->count;
> >
>
> This copy will not be atomic on 32-bit hosts.
Right. But it shouldn't be a problem as vcpu thread stops when vmexit to
host to process events. In addition, only current cpu in guest accesses
perf_events linked to current cpu.
>
> > +
> > + offset = kvmevent->event_offset;
> > + offset += offsetof(struct perf_event, hw);
> > + if (offset< PAGE_SIZE) {
> > + if (event_page == kvmevent->event_page2) {
> > + kunmap_atomic(shared_kaddr, KM_USER0);
> > + event_page = kvmevent->event_page;
> > + shared_kaddr = kmap_atomic(event_page, KM_USER0);
> > + }
> > + page_offset = offset;
> > + } else {
> > + if (event_page == kvmevent->event_page) {
> > + kunmap_atomic(shared_kaddr, KM_USER0);
> > + event_page = kvmevent->event_page2;
> > + shared_kaddr = kmap_atomic(event_page, KM_USER0);
> > + }
> > + page_offset = offset - PAGE_SIZE;
> > + }
> > +
> > + if (overflows)
> > + atomic_add(overflows, (atomic_t *)(shared_kaddr + page_offset));
> > +
> > + kunmap_atomic(shared_kaddr, KM_USER0);
> >
>
> Where is the actual event copied? I'm afraid it's hard to understand
> the code.
Sorry, I should have more statements around the codes.
When an event overflows, host kernel sync perf_event->count to guest os
perf_event->count, and increases the perf_event->hw.overflows.
+struct kvmperf_event {
+ unsigned int event_offset;
+ struct page *event_page;
+ struct page *event_page2;
+};
Actually, at host side, perf_event->shadow points to a kvmperf_event.
kvmperf_event->event_page points the physical page of guest perf_event. event_page2
points to the 2nd page if perf_event data structure layouts across 2 pages.
event_offset marks the offset start in the 1st page.
>
> > +#if 0
> > + offset += offsetof(struct hw_perf_event, prev_count);
> > + data_len = sizeof(struct hw_perf_event) -
> > + offsetof(struct hw_perf_event, prev_count);
> > + if (event_page == kvmevent->event_page2) {
> > + page_offset += offsetof(struct hw_perf_event, prev_count);
> > + memcpy(shared_kaddr + page_offset,
> > + &hwc->prev_count, data_len);
> > + kunmap_atomic(shared_kaddr, KM_USER0);
> > +
> > + return;
> > + }
> > +
> > + copied = 0;
> > + if (offset< PAGE_SIZE) {
> > + len = PAGE_SIZE - offset;
> > + if (len> data_len)
> > + len = data_len;
> > + memcpy(shared_kaddr + offset,
> > + &hwc->prev_count, data_len);
> > + copied = len;
> > + page_offset = 0;
> > + } else
> > + page_offset = offset - PAGE_SIZE;
> > +
> > + kunmap_atomic(shared_kaddr, KM_USER0);
> > + len = data_len - copied;
> > + if (len) {
> > + /* Copy across pages */
> > + shared_kaddr = kmap_atomic(kvmevent->event_page2, KM_USER0);
> > + memcpy(shared_kaddr + page_offset,
> > + ((void *)&hwc->prev_count) + copied, len);
> > + kunmap_atomic(shared_kaddr, KM_USER0);
> > + }
> > +#endif
> >
>
> Maybe here, but the #if 0 doesn't help.
The codes in '#if 0' is trying to copy more data back to guest os
perf_event. I comment them out as they are not really used by guest os.
>
> > +}
> > +
> >
> > +static struct perf_event *
> > +kvm_pv_perf_op_open(struct kvm_vcpu *vcpu, gpa_t addr)
> > +{
> > + int ret;
> > + struct perf_event *event;
> > + struct perf_event *host_event = NULL;
> > + struct kvmperf_event *shadow = NULL;
> > +
> > + event = kzalloc(sizeof(*event), GFP_KERNEL);
> > + if (!event)
> > + goto out;
> > +
> > + shadow = kzalloc(sizeof(*shadow), GFP_KERNEL);
> > + if (!shadow)
> > + goto out;
> > +
> > + shadow->event_page = gfn_to_page(vcpu->kvm, addr>> PAGE_SHIFT);
> > + shadow->event_offset = addr& ~PAGE_MASK;
> >
>
> Might truncate on 32-bit hosts. PAGE_MASK is 32-bit while addr is 64 bit.
Above codes just run at host side. Is it possible that host kernel is 32 bit and
guest kernel is 64bits? I really need separate the patch to host and guest parts
as one patch misleads you guys.
>
> > + if (shadow->event_offset + sizeof(struct perf_event)> PAGE_SIZE) {
> > + shadow->event_page2 = gfn_to_page(vcpu->kvm,
> > + (addr>> PAGE_SHIFT) + 1);
> > + }
> >
>
> My be simpler to make event_pages an array instead of two independent
> variables.
Ok, I will do so.
>
> > +
> > + ret = kvm_read_guest(vcpu->kvm, addr, event, sizeof(*event));
> > + if (ret)
> > + goto out;
> >
>
> I assume this is to check existence?
Here calling kvm_read_guest is to get a copy of guest perf_event as it has
perf_event.attr. Host need the attr to create host perf_event.
> It doesn't work well with memory
> hotplug. In general it's preferred to use
> kvm_read_guest()/kvm_write_guest() instead of gfn_to_page() during
> initialization to prevent pinning and allow for hotplug.
That's an issue. But host kernel couldn't go to sleep when processing event
overflows under NMI context.
>
> > +
> > + /*
> > + * By default, we disable the host event. Later on, guets os
> > + * triggers a perf_event_attach to enable it
> > + */
> > + event->attr.disabled = 1;
> > + event->attr.inherit = 0;
> > + event->attr.enable_on_exec = 0;
> > + /*
> > + * We don't support exclude mode of user and kernel for guest os,
> > + * which mean we always collect both user and kernel for guest os
> > + */
> > + event->attr.exclude_user = 0;
> > + event->attr.exclude_kernel = 0;
> > + /* We always create a cpu context host perf event */
> > +
> > + host_event = perf_event_create_kernel_counter(&event->attr, -1,
> > + current->pid, kvm_perf_event_overflow);
> > +
> > + if (IS_ERR(host_event)) {
> > + host_event = NULL;
> > + goto out;
> > + }
> > + host_event->shadow = shadow;
> > +
> > +out:
> > + if (!host_event)
> > + kfree(shadow);
> > + kfree(event);
> > +
> > + return host_event;
> > +}
> > +
> >
> > +
> > +int kvm_pv_perf_op(struct kvm_vcpu *vcpu, int op_code, unsigned long a1,
> > + unsigned long a2, unsigned long *result)
> > +{
> > + unsigned long ret;
> > + struct perf_event *host_event;
> > + gpa_t addr;
> > +
> > + addr = (gpa_t)(a1);
> >
>
> A 32-bit guest has a 64-bit gpa_t but 32-bit ulongs, so gpas need to be
> passed in two hypervall arguments.
Ok. Originally I pass 2 parameters like hc_gpa. I will redo it.
>
> > +
> > + switch(op_code) {
> > + case KVM_PERF_OP_OPEN:
> > + ret = (unsigned long) kvm_pv_perf_op_open(vcpu, addr);
> > + break;
> > + case KVM_PERF_OP_CLOSE:
> > + host_event = (struct perf_event *) a2;
> >
>
> First, you get truncation in a 32-bit guest. Second, you must validate
> the argument! The guest kernel can easily subvert the host by passing a
> bogus host_event.
You are right. I will implement the list method at host side.
>
> It may be better to have the guest create an id to the host, and the
> host can simply look up the id in a table:
Perhaps the address of guest perf_event is the best id.
>
> if (a2 >= KVM_PMU_MAX_EVENTS)
> bail out;
> host_event = vcpu->pmu.host_events[a2];
>
> > @@ -4052,6 +4054,16 @@ static unsigned long kvm_get_guest_ip(vo
> > return ip;
> > }
> >
> > +int kvm_notify_event_overflow(void)
> > +{
> > + if (percpu_read(current_vcpu)) {
> > + kvm_inject_nmi(percpu_read(current_vcpu));
> > + return 0;
> > + }
> > +
> > + return -1;
> > +}
> >
>
> This should really go through the APIC PERF LVT. No interface
> currently, but we are working on it.
>
> This way the guest can configure the perf interrupt to be NMI, INTR, or
> anything it likes.
>
Thanks for the pointer.
Yanmin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/