Re: [PATCH v4 4/6] KVM: x86/pmu: Re-evaluate Host-Only/Guest-Only on nested SVM transitions
From: Yosry Ahmed
Date: Fri Apr 24 2026 - 03:01:53 EST
> > > What do you think about having two flavors of kvm_pmu_handle_nested_transition()?
> > > One that defers via a request, and a "special" (SVM-only?) version that does
> > > direct updates.
> > >
> > > Poking into PMU state in arbitrary contexts makes me nervous. E.g. when calling
> > > svm_leave_nested(), odds are good EFER isn't even correct, and the update *needs*
> > > to be deferred.
> >
> > Hmm is it really that bad?
>
> It's not horrible, but it's a lot of "I think" and "should" and whatnot. I
> generally agree that it's unlikely to be a problem, but I can point at far too
> many bugs where KVM unexpectedly invokes a helper and consumes stale state.
>
> I'm not completely opposed to non-deferred updates, but I really don't want to
> use them for svm_leave_nested().
That makes sense, I had similar thoughts at some point.
>
> > - In the emulated VMRUN and #VMEXIT paths, EFER.SVME should be set in
> > both L1 and L2, so it should be fine.
> >
> > - In the restore path entering guest mode, EFER.SVME should also be set
> > in both L1 and L2.
> >
> > So I think svm_leave_nested() is the only interesting case:
> >
> > - In the restore path, svm_leave_nested() should only be called if the
> > CPU is in guest mode and EFER.SVME is set in both L1 and L2.
> >
> > - In the EFER update path, L1 will get a shutdown if we forcefully leave
> > nested anyway, unless userspace is setting state. Either way, the
> > value of EFER should be correct and valid to use to update the PMU
> > here.
> >
> > - In the vCPU free path, it shouldn't really matter, but the value of
> > EFER should still be correct.
>
> > So I *think* generally the value of EFER should be fine to use. The
> > other inputs are is_guest_mode() and eventsel. In both cases we should
> > just make sure to update the PMU *after* updating the state.
> >
> > So I think we'd end up with something similar to Jim's v2:
> > https://lore.kernel.org/kvm/20260129232835.3710773-1-jmattson@xxxxxxxxxx/
> >
> > We will directly re-evaluate eventsel_hw on nested transitions, EFER
> > updates, and PMU MSR updates -- without deferring anything.
> >
> > We'd still need to make other changes:
> > - Always disable the PMC if EFER.SVME is clear and either H/G bit (or
> > both) is set.
> >
> > - Handle VMRUN correctly. I was going to suggest just moving the call to
> > kvm_skip_emulated_instruction() to the end of the function, but that
> > doesn't handle the case where we inject #VMEXIT(INVALID) due to a
> > VMRUN failure (e.g. consistency checks, loading CR3, etc).
> >
> > I am actually not sure if the instruction should count in host or
> > guest mode in this case. Logically, we never entered the guest, so
> > perhaps counting it in host mode is the right thing to do? I think
> > we'll also need to test what HW does.
> >
> > Honestly, it would be a lot easier of someone from AMD could just tell
> > us these things :)
> >
> > Basically:
> > - Does the PMU generally count based on processor state (e.g. guest
> > mode, EFER.SVME) before or after instruction retirement?
> > - A successful VMRUN will be counted in guest mode, what about a
> > failed VMRUN that produces #VMEXIT(INVALID)?
> >
> > > I definitely don't love having two separate update mechanisms, but it seems like
> > > the safest option in this case.
> >
> > Same here, and I like the deferred handling, but to Jim's point I think
> > we can use it anywhere :/
>
> Why can't we defer the svm_leave_nested() case? The only flows the invoke
> svm_leave_nested() are non-architectural, being precise there doesn't matter at
> all (and I'm not convinced it matters in general given none of us can figure
> out what hardware is _supposed_ to do).
>
> Having a synchronous path for architectural flows, and a deferred mechanism for
> everything else seems reasonable, and would all but eliminate my concerns about
> consuming stale state and/or doing things like attempting to write MSRs while
> freeing a vCPU.
Sounds good to me. See my other reply about specifics.