Re: [PATCH v4 4/6] KVM: x86/pmu: Re-evaluate Host-Only/Guest-Only on nested SVM transitions

From: Sean Christopherson

Date: Wed Apr 22 2026 - 18:43:02 EST


On Tue, Apr 21, 2026, Yosry Ahmed wrote:
> On Thu, Apr 09, 2026 at 02:21:14PM -0700, Sean Christopherson wrote:
> > On Thu, Apr 09, 2026, Sean Christopherson wrote:
> > > On Thu, Apr 09, 2026, Jim Mattson wrote:
> > > > On Thu, Apr 9, 2026 at 10:48 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > > > On Thu, Apr 09, 2026, Jim Mattson wrote:
> > > > > > > > In general, this deferral is misguided. The G/H bits should be
> > > > > > > > re-evaluated before we call kvm_pmu_instruction_retired() for an
> > > > > > > > emulated instruction.

...

> > > > > > > > This happens too late for VMRUN, since we have already called
> > > > > > > > kvm_pmu_instruction_retired() via kvm_skip_emulated_instruction(), and
> > > > > > > > VMRUN counts as a *guest* instruction.
> > > > > > >
> > > > > > > It's just VMRUN that's problematic though, correct? I.e. the scheme as a whole
> > > > > > > is fine, we just need to special case VMRUN due to SVM's erratum^Warchitecture.
> > > > > > > Alternatively, maybe we could get AMD to document the silly VMRUN behavior as an
> > > > > > > erratum, then we could claim KVM is architecturally superior. :-D
> > > > > >
> > > > > > Here, it's just VMRUN. Above, it's WRMSR(EFER).
> > > > >
> > > > > But clearing EFER.SVME while in the guest generates architecturally undefined
> > > > > behavior. I don't see any reason to complicate PMU virtualization for that
> > > > > scenario, especially now that KVM synthesizes triple fault for L1.
> > > >
> > > > L1 can clear the virtual EFER.SVME. That is well-defined.
> > >
> > > Gah, I forgot that the H/G bits are ignored when EFER.SVME=0. That's really
> > > annoying.
> >
> > What do you think about having two flavors of kvm_pmu_handle_nested_transition()?
> > One that defers via a request, and a "special" (SVM-only?) version that does
> > direct updates.
> >
> > Poking into PMU state in arbitrary contexts makes me nervous. E.g. when calling
> > svm_leave_nested(), odds are good EFER isn't even correct, and the update *needs*
> > to be deferred.
>
> Hmm is it really that bad?

It's not horrible, but it's a lot of "I think" and "should" and whatnot. I
generally agree that it's unlikely to be a problem, but I can point at far too
many bugs where KVM unexpectedly invokes a helper and consumes stale state.

I'm not completely opposed to non-deferred updates, but I really don't want to
use them for svm_leave_nested().

> - In the emulated VMRUN and #VMEXIT paths, EFER.SVME should be set in
> both L1 and L2, so it should be fine.
>
> - In the restore path entering guest mode, EFER.SVME should also be set
> in both L1 and L2.
>
> So I think svm_leave_nested() is the only interesting case:
>
> - In the restore path, svm_leave_nested() should only be called if the
> CPU is in guest mode and EFER.SVME is set in both L1 and L2.
>
> - In the EFER update path, L1 will get a shutdown if we forcefully leave
> nested anyway, unless userspace is setting state. Either way, the
> value of EFER should be correct and valid to use to update the PMU
> here.
>
> - In the vCPU free path, it shouldn't really matter, but the value of
> EFER should still be correct.

> So I *think* generally the value of EFER should be fine to use. The
> other inputs are is_guest_mode() and eventsel. In both cases we should
> just make sure to update the PMU *after* updating the state.
>
> So I think we'd end up with something similar to Jim's v2:
> https://lore.kernel.org/kvm/20260129232835.3710773-1-jmattson@xxxxxxxxxx/
>
> We will directly re-evaluate eventsel_hw on nested transitions, EFER
> updates, and PMU MSR updates -- without deferring anything.
>
> We'd still need to make other changes:
> - Always disable the PMC if EFER.SVME is clear and either H/G bit (or
> both) is set.
>
> - Handle VMRUN correctly. I was going to suggest just moving the call to
> kvm_skip_emulated_instruction() to the end of the function, but that
> doesn't handle the case where we inject #VMEXIT(INVALID) due to a
> VMRUN failure (e.g. consistency checks, loading CR3, etc).
>
> I am actually not sure if the instruction should count in host or
> guest mode in this case. Logically, we never entered the guest, so
> perhaps counting it in host mode is the right thing to do? I think
> we'll also need to test what HW does.
>
> Honestly, it would be a lot easier of someone from AMD could just tell
> us these things :)
>
> Basically:
> - Does the PMU generally count based on processor state (e.g. guest
> mode, EFER.SVME) before or after instruction retirement?
> - A successful VMRUN will be counted in guest mode, what about a
> failed VMRUN that produces #VMEXIT(INVALID)?
>
> > I definitely don't love having two separate update mechanisms, but it seems like
> > the safest option in this case.
>
> Same here, and I like the deferred handling, but to Jim's point I think
> we can use it anywhere :/

Why can't we defer the svm_leave_nested() case? The only flows the invoke
svm_leave_nested() are non-architectural, being precise there doesn't matter at
all (and I'm not convinced it matters in general given none of us can figure
out what hardware is _supposed_ to do).

Having a synchronous path for architectural flows, and a deferred mechanism for
everything else seems reasonable, and would all but eliminate my concerns about
consuming stale state and/or doing things like attempting to write MSRs while
freeing a vCPU.