Re: [PATCH] KVM: x86/pmu: Add '.exclude_hv = 1' for guest perf_event
From: peterz
Date: Wed Aug 12 2020 - 07:11:55 EST
On Wed, Aug 12, 2020 at 12:25:43PM +0200, Paolo Bonzini wrote:
> On 12/08/20 07:07, Like Xu wrote:
> > To emulate PMC counter for guest, KVM would create an
> > event on the host with 'exclude_guest=0, exclude_hv=0'
> > which simply makes no sense and is utterly broken.
> >
> > To keep perf semantics consistent, any event created by
> > pmc_reprogram_counter() should both set exclude_hv and
> > exclude_host in the KVM context.
> >
> > Message-ID: <20200811084548.GW3982@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
> > Signed-off-by: Like Xu <like.xu@xxxxxxxxxxxxxxx>
> > ---
> > arch/x86/kvm/pmu.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> > index 67741d2a0308..6a30763a10d7 100644
> > --- a/arch/x86/kvm/pmu.c
> > +++ b/arch/x86/kvm/pmu.c
> > @@ -108,6 +108,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
> > .exclude_host = 1,
> > .exclude_user = exclude_user,
> > .exclude_kernel = exclude_kernel,
> > + .exclude_hv = 1,
> > .config = config,
> > };
> >
> >
>
> x86 does not have a hypervisor privilege level, so it never uses
Arguably it does when Xen, but I don't think we support that, so *phew*.
> exclude_hv; exclude_host already excludes all root mode activity for
> both ring0 and ring3.
Right, but we want to tighten the permission checks and not excluding_hv
is just sloppy.
That said; the exclude_host / exclude_guest thing is a giant trainwreck
and I'm really not sure what to do about it.
The thing is, we very much do not want to allow unpriv user to be able
to create: exclude_host=1, exclude_guest=0 counters (they currently
can).
So we really want to add:
if ((!exclude_host || !exclude_guest || !exclude_hv) && !perf_allow_kernel())
return -EACCESS;
But the problem is, they were added late, so lots of userspace will not
be setting those fields (or might not have even known about them), so we
got to somehow deal with:
exclude_host == exclude_guest == 0
And because of that, we now also have genius code like
(intel_set_masks, amd_core_hw_config has even 'funnier' code):
if (event->attr.exclude_host)
__set_bit(idx, (unsigned long *)&cpuc->intel_ctrl_guest_mask);
if (event->attr.exclude_guest)
__set_bit(idx, (unsigned long *)&cpuc->intel_ctrl_host_mask);
Which is just confusing and bad, that really should've been:
if (!event->attr.exclude_host)
__set_bit(idx, (unsigned long *)&cpuc->intel_ctrl_host_mask);
if (!event->attr.exclude_guest)
__set_bit(idx, (unsigned long *)&cpuc->intel_ctrl_guest_mask);
:-(
Also, exclude_host is really poorly defined:
https://lkml.kernel.org/r/20200806091827.GY2674@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
"Suppose we have nested virt:
L0-hv
|
G0/L1-hv
|
G1
And we're running in G0, then:
- 'exclude_hv' would exclude L0 events
- 'exclude_host' would ... exclude L1-hv events?
- 'exclude_guest' would ... exclude G1 events?
Then the next question is, if G0 is a host, does the L1-hv run in
G0 userspace or G0 kernel space?
I was assuming G0 userspace would not include anything L1 (kvm is a
kernel module after all), but what do I know."
The way it is implemented, you basically have to always set
exclude_host=0, even if there is no virt at all and you want to measure
your own userspace thing -- which is just weird.
Meanwhile ARM64 couldn't quite figure out what it was all supposed to be
either and also implemented something -- and i've not tried to
understand what exactly, but hopefully compatible enough that we're not
in an even worse corner.
So on the one hand we're now leaking all sorts due to lack of permission
checks, on the other hand we can't fix it because we're not allowed to
break userspace :-(
I suppose the 'best' option at this point is something like:
/*
* comment that explains the trainwreck.
*/
if (!exclude_host && !exclude_guest)
exclude_guest = 1;
if ((!exclude_hv || !exclude_guest) && !perf_allow_kernel())
return -EPERM;
But that takes away the possibility of actually having:
'exclude_host=0, exclude_guest=0' to create an event that measures both,
which also sucks.