Re: [RFC PATCH 02/41] perf: Support guest enter/exit interfaces

From: Sean Christopherson
Date: Fri Apr 12 2024 - 15:17:12 EST


On Thu, Apr 11, 2024, Kan Liang wrote:
> >> +/*
> >> + * When a guest enters, force all active events of the PMU, which supports
> >> + * the VPMU_PASSTHROUGH feature, to be scheduled out. The events of other
> >> + * PMUs, such as uncore PMU, should not be impacted. The guest can
> >> + * temporarily own all counters of the PMU.
> >> + * During the period, all the creation of the new event of the PMU with
> >> + * !exclude_guest are error out.
> >> + */
> >> +void perf_guest_enter(void)
> >> +{
> >> + struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
> >> +
> >> + lockdep_assert_irqs_disabled();
> >> +
> >> + if (__this_cpu_read(__perf_force_exclude_guest))
> >
> > This should be a WARN_ON_ONCE, no?
>
> To debug the improper behavior of KVM?

Not so much "debug" as ensure that the platform owner noticies that KVM is buggy.

> >> +static inline int perf_force_exclude_guest_check(struct perf_event *event,
> >> + int cpu, struct task_struct *task)
> >> +{
> >> + bool *force_exclude_guest = NULL;
> >> +
> >> + if (!has_vpmu_passthrough_cap(event->pmu))
> >> + return 0;
> >> +
> >> + if (event->attr.exclude_guest)
> >> + return 0;
> >> +
> >> + if (cpu != -1) {
> >> + force_exclude_guest = per_cpu_ptr(&__perf_force_exclude_guest, cpu);
> >> + } else if (task && (task->flags & PF_VCPU)) {
> >> + /*
> >> + * Just need to check the running CPU in the event creation. If the
> >> + * task is moved to another CPU which supports the force_exclude_guest.
> >> + * The event will filtered out and be moved to the error stage. See
> >> + * merge_sched_in().
> >> + */
> >> + force_exclude_guest = per_cpu_ptr(&__perf_force_exclude_guest, task_cpu(task));
> >> + }
> >
> > These checks are extremely racy, I don't see how this can possibly do the
> > right thing. PF_VCPU isn't a "this is a vCPU task", it's a "this task is about
> > to do VM-Enter, or just took a VM-Exit" (the "I'm a virtual CPU" comment in
> > include/linux/sched.h is wildly misleading, as it's _only_ valid when accounting
> > time slices).
> >
>
> This is to reject an !exclude_guest event creation for a running
> "passthrough" guest from host perf tool.
> Could you please suggest a way to detect it via the struct task_struct?
>
> > Digging deeper, I think __perf_force_exclude_guest has similar problems, e.g.
> > perf_event_create_kernel_counter() calls perf_event_alloc() before acquiring the
> > per-CPU context mutex.
>
> Do you mean that the perf_guest_enter() check could be happened right
> after the perf_force_exclude_guest_check()?
> It's possible. For this case, the event can still be created. It will be
> treated as an existing event and handled in merge_sched_in(). It will
> never be scheduled when a guest is running.
>
> The perf_force_exclude_guest_check() is to make sure most of the cases
> can be rejected at the creation place. For the corner cases, they will
> be rejected in the schedule stage.

Ah, the "rejected in the schedule stage" is what I'm missing. But that creates
a gross ABI, because IIUC, event creation will "randomly" succeed based on whether
or not a CPU happens to be running in a KVM guest. I.e. it's not just the kernel
code that has races, the entire event creation is one big race.

What if perf had a global knob to enable/disable mediate PMU support? Then when
KVM is loaded with enable_mediated_true, call into perf to (a) check that there
are no existing !exclude_guest events (this part could be optional), and (b) set
the global knob to reject all new !exclude_guest events (for the core PMU?).

Hmm, or probably better, do it at VM creation. That has the advantage of playing
nice with CONFIG_KVM=y (perf could reject the enabling without completely breaking
KVM), and not causing problems if KVM is auto-probed but the user doesn't actually
want to run VMs.

E.g. (very roughly)

int x86_perf_get_mediated_pmu(void)
{
if (refcount_inc_not_zero(...))
return 0;

if (<system wide events>)
return -EBUSY;

<slow path with locking>
}

void x86_perf_put_mediated_pmu(void)
{
if (!refcount_dec_and_test(...))
return;

<slow path with locking>
}

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1bbf312cbd73..f2994377ef44 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12467,6 +12467,12 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
if (type)
return -EINVAL;

+ if (enable_mediated_pmu)
+ ret = x86_perf_get_mediated_pmu();
+ if (ret)
+ return ret;
+ }
+
ret = kvm_page_track_init(kvm);
if (ret)
goto out;
@@ -12518,6 +12524,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
kvm_mmu_uninit_vm(kvm);
kvm_page_track_cleanup(kvm);
out:
+ x86_perf_put_mediated_pmu();
return ret;
}

@@ -12659,6 +12666,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
kvm_page_track_cleanup(kvm);
kvm_xen_destroy_vm(kvm);
kvm_hv_destroy_vm(kvm);
+ x86_perf_put_mediated_pmu();
}

static void memslot_rmap_free(struct kvm_memory_slot *slot)