Re: [PATCH 1/2] KVM: nVMX: fix CR4_READ_SHADOW when L0 updates CR4 during a signal

From: Sean Christopherson
Date: Wed Apr 17 2024 - 12:13:23 EST


On Wed, Apr 17, 2024, Thomas Prescher wrote:
> On Tue, 2024-04-16 at 11:07 -0700, Sean Christopherson wrote:
> > Hur dur, I forgot that KVM provides a "guest_mode" stat.  Userspace can do
> > KVM_GET_STATS_FD on the vCPU FD to get a file handle to the binary stats,
> > and then you wouldn't need to call back into KVM just to query guest_mode.
> >
> > Ah, and I also forgot that we have kvm_run.flags, so adding
> > KVM_RUN_X86_GUEST_MODE would also be trivial (I almost suggested it
> > earlier, but didn't want to add a new field to kvm_run without a very good
> > reason).
>
> Thanks for the pointers. This is really helpful.
>
> I tried the "guest_mode" stat as you suggested and it solves the
> immediate issue we have with VirtualBox/KVM.

Note,

> What I don't understand is that we do not get the effective CR4 value
> of the L2 guest in kvm_run.s.regs.sregs.cr4.

Because what you're asking for is *not* the effective CR4 value of L2.

E.g. if L1 is using legacy shadowing paging to run L2, L1 is likely going to run
L2 with GUEST_CR0.PG=1, GUEST_CR4.PAE=1, and GUEST_CR4.PSE=0 (though PSE is largely
irrelevant), i.e. will either use PAE paging or 64-bit paging to shadow L2.

But L2 itself could be unpaged (CR0.PG=0, CR4.PAE=x, CR4.PSE=x), using 32-bit
paging (CR0.PG=1, CR4.PAE=0, CR4.PSE=0), or using 32-bit paging with 4MiB hugepages
(CR0.PG=1, CR4.PAE=0, CR4.PSE=1). In all of those cases, the effective CR0 and CR4
values consumed by hardware are CR0.PG=1, CR4.PAE=1, and CR4.PSE.

Or to convolute things even further, if L0 is running L1 with shadowing paging,
and L1 is running L2 with shadow paging but doing something weird and using PSE
paging, then it would be possible to end up with:

vmcs12->guest_cr4:
.pae = 0
.pse = 1

vmcs12->cr4_read_shadow:
.pae = 0
.pse = 0

vmcs02->guest_cr4:
.pae = 1
.pse = 0

> Instead, userland sees the contents of Vmcs::GUEST_CR4. Shouldn't this be the
> combination of GUEST_CR4, GUEST_CR4_MASK and CR4_READ_SHADOW, i.e. what L2
> actually sees as CR4 value?

Because KVM_{G,S}ET_SREGS (and all other uAPIs in that vein) are defined to operate
on actual vCPU state, and having them do something different if the vCPU is in guest
mode would confusing/odd, and nonsensical to differences between VMX and SVM.

SVM doesn't have per-bit CR0/CR4 controls, i.e. CR4 loads and stores need to be
intercepted, and so having KVM_{G,S}ET_SREGS operate on CR4_READ_SHADOW for VMX
would yield different ABI for VMX versus SVM.

Note, what L2 *sees* is not a combination of the above; what L2 sees is purely
CR4_READ_SHADOW. The other fields are consulted only if L2 attempts to load CR4.

> If this is expected, can you please explain the reasoning behind this
> interface decision? For me, it does not make sense that writing back
> the same value we receive at exit time causes a change in what L2 sees
> for CR4.

I doubt there was ever a concious decision, rather it never came up and thus the
code is the result of doing nothing when nested VMX support was added.

That said, KVM's behavior is probably the least awful choice. The changelog of
the proposed patch is wrong when it says:

If the signal is meant to be delivered to the L0 VMM, and L0 updates CR4 for L1

because the update isn't for L1, it's for the active vCPU state, which is L2.

At first glance, skipping the vmcs02.CR4_READ_SHADOW seems to make sense, but it
would create a bizarre inconsistency as KVM_SET_SREGS would effectively override
vmcs12->guest_cr4, but not vmcs12->cr4_read_shadow. KVM doesn't know the intent
of userspace, i.e. KVM can't know if userspace wants to change just the effective
value for CR4, or if userspace wants to change the effective *and* observable
value for CR4.

In your case, where writing CR4 is spurious, preserving the read shadow works,
but if there were some use case where userspace actually wanted to change L2's
CR4, leaving the read shadow set to vmcs12 would be wrong.

The whole situation is rather nonsensical, because if userspace actually did change
CR4, the changes would be lost on the next nested VM-Exit => VM-Entry. That could
be solved by writing to vmcs12, but that creates a headache of its own because then
userspace changes to L2 become visible to L1, without userspace explicitly requesting
that.

Unfortunately, simply disallowing state save/restore when L2 is active doesn't
work either, because userspace needs to be able to save/restore state that _isn't_
context switched by hardware, i.e. isn't in the VMCS or VMCB.

In short, yes, it's goofy and annoying, but there's no great solution and the
issue really does need to be solved/avoided in userspace

> Another question is: when we want to save the VM state during a
> savevm/loadvm cycle, we kick all vCPUs via a singal and save their
> state. If any vCPU runs in L2 at the time of the exit, we somehow need
> to let it continue to run until we get an exit with the L1 state. Is
> there a mechanism to help us here?

Hmm, no? What is it you're trying to do, i.e. why are you doing save/load? If
you really want to save/load _all_ state, the right thing to do is to also save
and load nested state.