Re: [PATCH 1/2] KVM: nVMX: fix CR4_READ_SHADOW when L0 updates CR4 during a signal

From: Thomas Prescher
Date: Thu Apr 18 2024 - 09:46:55 EST


On Wed, 2024-04-17 at 09:11 -0700, Sean Christopherson wrote:
> On Wed, Apr 17, 2024, Thomas Prescher wrote:
> > On Tue, 2024-04-16 at 11:07 -0700, Sean Christopherson wrote:
> > > Hur dur, I forgot that KVM provides a "guest_mode" stat. 
> > > Userspace can do
> > > KVM_GET_STATS_FD on the vCPU FD to get a file handle to the
> > > binary stats,
> > > and then you wouldn't need to call back into KVM just to query
> > > guest_mode.
> > >
> > > Ah, and I also forgot that we have kvm_run.flags, so adding
> > > KVM_RUN_X86_GUEST_MODE would also be trivial (I almost suggested
> > > it
> > > earlier, but didn't want to add a new field to kvm_run without a
> > > very good
> > > reason).
> >
> > Thanks for the pointers. This is really helpful.
> >
> > I tried the "guest_mode" stat as you suggested and it solves the
> > immediate issue we have with VirtualBox/KVM.
>
> Note,
>
> > What I don't understand is that we do not get the effective CR4
> > value
> > of the L2 guest in kvm_run.s.regs.sregs.cr4.
>
> Because what you're asking for is *not* the effective CR4 value of
> L2.
>
> E.g. if L1 is using legacy shadowing paging to run L2, L1 is likely
> going to run
> L2 with GUEST_CR0.PG=1, GUEST_CR4.PAE=1, and GUEST_CR4.PSE=0 (though
> PSE is largely
> irrelevant), i.e. will either use PAE paging or 64-bit paging to
> shadow L2.
>
> But L2 itself could be unpaged (CR0.PG=0, CR4.PAE=x, CR4.PSE=x),
> using 32-bit
> paging (CR0.PG=1, CR4.PAE=0, CR4.PSE=0), or using 32-bit paging with
> 4MiB hugepages
> (CR0.PG=1, CR4.PAE=0, CR4.PSE=1).  In all of those cases, the
> effective CR0 and CR4
> values consumed by hardware are CR0.PG=1, CR4.PAE=1, and CR4.PSE.
>
> Or to convolute things even further, if L0 is running L1 with
> shadowing paging,
> and L1 is running L2 with shadow paging but doing something weird and
> using PSE
> paging, then it would be possible to end up with:
>
>   vmcs12->guest_cr4:
>      .pae = 0
>      .pse = 1
>
>   vmcs12->cr4_read_shadow:
>      .pae = 0
>      .pse = 0
>
>   vmcs02->guest_cr4:
>      .pae = 1
>      .pse = 0
>
> > Instead, userland sees the contents of Vmcs::GUEST_CR4. Shouldn't
> > this be the
> > combination of GUEST_CR4, GUEST_CR4_MASK and CR4_READ_SHADOW, i.e.
> > what L2
> > actually sees as CR4 value?
>
> Because KVM_{G,S}ET_SREGS (and all other uAPIs in that vein) are
> defined to operate
> on actual vCPU state, and having them do something different if the
> vCPU is in guest
> mode would confusing/odd, and nonsensical to differences between VMX
> and SVM.
>
> SVM doesn't have per-bit CR0/CR4 controls, i.e. CR4 loads and stores
> need to be
> intercepted, and so having KVM_{G,S}ET_SREGS operate on
> CR4_READ_SHADOW for VMX
> would yield different ABI for VMX versus SVM.
>
> Note, what L2 *sees* is not a combination of the above; what L2 sees
> is purely
> CR4_READ_SHADOW.  The other fields are consulted only if L2 attempts
> to load CR4.
>
> > If this is expected, can you please explain the reasoning behind
> > this
> > interface decision? For me, it does not make sense that writing
> > back
> > the same value we receive at exit time causes a change in what L2
> > sees
> > for CR4.
>
> I doubt there was ever a concious decision, rather it never came up
> and thus the
> code is the result of doing nothing when nested VMX support was
> added.
>
> That said, KVM's behavior is probably the least awful choice.  The
> changelog of
> the proposed patch is wrong when it says:
>
>   If the signal is meant to be delivered to the L0 VMM, and L0
> updates CR4 for L1
>
> because the update isn't for L1, it's for the active vCPU state,
> which is L2.
>
> At first glance, skipping the vmcs02.CR4_READ_SHADOW seems to make
> sense, but it
> would create a bizarre inconsistency as KVM_SET_SREGS would
> effectively override
> vmcs12->guest_cr4, but not vmcs12->cr4_read_shadow.  KVM doesn't know
> the intent
> of userspace, i.e. KVM can't know if userspace wants to change just
> the effective
> value for CR4, or if userspace wants to change the effective *and*
> observable
> value for CR4.
>
> In your case, where writing CR4 is spurious, preserving the read
> shadow works,
> but if there were some use case where userspace actually wanted to
> change L2's
> CR4, leaving the read shadow set to vmcs12 would be wrong.
>
> The whole situation is rather nonsensical, because if userspace
> actually did change
> CR4, the changes would be lost on the next nested VM-Exit => VM-
> Entry.  That could
> be solved by writing to vmcs12, but that creates a headache of its
> own because then
> userspace changes to L2 become visible to L1, without userspace
> explicitly requesting
> that.
>
> Unfortunately, simply disallowing state save/restore when L2 is
> active doesn't
> work either, because userspace needs to be able to save/restore state
> that _isn't_
> context switched by hardware, i.e. isn't in the VMCS or VMCB.
>
> In short, yes, it's goofy and annoying, but there's no great solution
> and the
> issue really does need to be solved/avoided in userspace
>
> > Another question is: when we want to save the VM state during a
> > savevm/loadvm cycle, we kick all vCPUs via a singal and save their
> > state. If any vCPU runs in L2 at the time of the exit, we somehow
> > need
> > to let it continue to run until we get an exit with the L1 state.
> > Is
> > there a mechanism to help us here?
>
> Hmm, no?  What is it you're trying to do, i.e. why are you doing
> save/load?  If
> you really want to save/load _all_ state, the right thing to do is to
> also save
> and load nested state.
>

You are right. After your pointers and looking at the nesting code
again, I think I know what to do. Just to make sure I understand this
correctly: 

If L0 exits with L2 state, KVM_GET_NESTED_STATE will have
KVM_STATE_NESTED_RUN_PENDING set in the flags field. So when we restore
the vCPU state after a vmsave/vmload cycle, we don't need to update
anything in kvm_run.s.regs because KVM will enter the L2 immediately.
Is that correct?