Re: [PATCH 2/2] VMX: nSVM: enter protected mode prior to returning to nested guest from SMM

From: Maxim Levitsky
Date: Mon Aug 30 2021 - 08:45:45 EST


On Thu, 2021-08-26 at 16:23 +0000, Sean Christopherson wrote:
> On Thu, Aug 26, 2021, Maxim Levitsky wrote:
> > SMM return code switches CPU to real mode, and
> > then the nested_vmx_enter_non_root_mode first switches to vmcs02,
> > and then restores CR0 in the KVM register cache.
> >
> > Unfortunately when it restores the CR0, this enables the protection mode
> > which leads us to "restore" the segment registers from
> > "real mode segment cache", which is not up to date vs L2 and trips
> > 'vmx_guest_state_valid check' later, when the
> > unrestricted guest mode is not enabled.
>
> I suspect this is slightly inaccurate. When loading vmcs02, vmx_switch_vmcs()
> will do vmx_register_cache_reset(), which also causes the segment cache to be
> reset. enter_pmode() will still load stale values, but they'll come from vmcs02,
> not KVM's segment register cache.
>
> > This happens to work otherwise, because after we enter the nested guest,
> > we restore its register state again from SMRAM with correct values
> > and that includes the segment values.
> >
> > As a workaround to this if we enter protected mode first,
> > then setting CR0 won't cause this damage.
> >
> > Signed-off-by: Maxim Levitsky <mlevitsk@xxxxxxxxxx>
> > ---
> > arch/x86/kvm/vmx/vmx.c | 7 +++++++
> > 1 file changed, 7 insertions(+)
> >
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 0c2c0d5ae873..805c415494cf 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -7507,6 +7507,13 @@ static int vmx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
> > }
> >
> > if (vmx->nested.smm.guest_mode) {
> > +
> > + /*
> > + * Enter protected mode to avoid clobbering L2's segment
> > + * registers during nested guest entry
> > + */
> > + vmx_set_cr0(vcpu, vcpu->arch.cr0 | X86_CR0_PE);
>
> I'd really, really, reaaaally like to avoid stuffing state. All of the instances
> I've come across where KVM has stuffed state for something like this were just
> papering over one symptom of an underlying bug.

I can't agree more with you on this. I even called this patch a hack in the cover letter,
because I didn't like it either.


>
> For example, won't this now cause the same bad behavior if L2 is in Real Mode?
>
> Is the problem purely that emulation_required is stale? If so, how is it stale?
> Every segment write as part of RSM emulation should reevaluate emulation_required
> via vmx_set_segment().

So this is what is happening:

1. rsm emulation switches the vCPU from the 64 bit protected mode (since BIOS SMM handler
of course switches to it) to real mode via CR0 write.

Here 'enter_rmode' is called which saves current segment register values in 'real mode segemnt cache',
and then fixes the values in VMCS to 'work' in vm86 mode. The saved architectural values in that 'cache'
are then used, when trying to read them (e.g via vmx_get_segment)

2. vmx_leave_smm is called which calls nested_vmx_enter_non_root_mode
this is unusually done in real mode, while otherwise VMX non root mode entry is
only possible from protected mode (all vmx instructions #UD in real mode).

3. nested_vmx_enter_non_root_mode first thing switches to vmcb02 by vmx_switch_vmcs
which 'loads' the L2 segments, because it zeros the segment cache via vmx_register_cache_reset),
so any attempt to read them will read them from vmcs02.

That means that at this point all good segment values are loaded.

4. Now prepare_vmcs02 is called which eventually sets KVM's CR0 using 'vmx_set_cr0'

At that point that function notices that we are entering protected mode and thus
enter_pmode is called, which first reads the segment values from the real mode segment
cache (which reflect sadly change to CS that rsm emulation did), updates their base & selectors
but not segment types, and writes back these segments, corrupting the L2 state.

The code is:

vmx_get_segment(vcpu, &vmx->rmode.segs[VCPU_SREG_CS], VCPU_SREG_CS); // reads segment cache since vmx->rmode.vm86_active = 1;
...
vmx->rmode.vm86_active = 0;
...
fix_pmode_seg(vcpu, VCPU_SREG_CS, &vmx->rmode.segs[VCPU_SREG_CS]):
__vmx_set_segment(vcpu, save, seg);


My hack was to avoid all this by setting protected mode first and then doing the nested
entry, which is more natural as I said above.


>
> Oooooh, or are you talking about the explicit vmx_guest_state_valid() in prepare_vmcs02()?
> If that's the case, then we likely should skip that check entirely. The only part
> I'm not 100% clear on is whether or not it can/should be skipped for vmx_set_nested_state().

Yes. Initially in the first version (which I didn't post) of the patches I indeed just removed this check and it
works sans another fix which is correct to have anyway,
(see note below).

The L2 will briefly have invalid state and it will be fixed by loading registers from SMRAM.


For vmx_set_nested_state I suspect something similiar can happen at least in theory:
We load the nested state, and then restore the registers, and only then the state becomes valid.

So it makes sense to remove this check for all but from_entry==true case.

However we do need to extend the check in vmx_vcpu_run that if the guest state is not valid and we
are nested, then fail instead of emulating.
I'll do this.


NOTE: There is another fix that has to be done if I remove the
check for validity of the nested state in nested_vmx_enter_non_root_mode, instead of stuffing
the protected mode state hack:

This is what is happening:

1. rsm emulation switches vCPU (that is vmcs01) to real mode, this state is left in vmcs01
This means that now L1 state is not valid as well!
(but with my hack that switches vCPU to protected mode, this doesn't happen accidentaly!)


2. We switch to vmcb02, L2 state temporary invalid as it has protected mode segments and real mode.

3. rsm emulation loads L2 registers from SMBASE, and makes the L2 state valid again.

4. we (optionally) enter L2

5. we exit to L1. L1 guest state is real mode, and invalid now.

We overwrite L1's guest state with vmcb12 host state which is *valid*, however the way the
'load_vmcs12_host_state' works, is that it uses __vmx_set_segment which doesn't update
'emulation_required', and thus the L1 state doesn't become valid,
we try to emulate it and crash eventually as the emulator can't really emulate everything.

I am now posting a new version of my SMM fixes with title '[PATCH v2 0/6] KVM: few more SMM fixes'
(I merged the SVM and VMX fixes in single patch series), and I include all of the above there.

Thanks again for the review!

Best regards,
Maxim Levitsky


>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index bc6327950657..20bd84554c1f 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -2547,7 +2547,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
> * which means L1 attempted VMEntry to L2 with invalid state.
> * Fail the VMEntry.
> */
> - if (CC(!vmx_guest_state_valid(vcpu))) {
> + if (from_vmentry && CC(!vmx_guest_state_valid(vcpu))) {
> *entry_failure_code = ENTRY_FAIL_DEFAULT;
> return -EINVAL;
> }
>
>
> If we want to retain the check for the common vmx_set_nested_state() path, i.e.
> when the vCPU is truly being restored to guest mode, then we can simply exempt
> the smm.guest_mode case (which also exempts that case when its set via
> vmx_set_nested_state()). The argument would be that RSM is going to restore L2
> state, so whatever happens to be in vmcs12/vmcs02 is stale.
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index bc6327950657..ac30ba6a8592 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -2547,7 +2547,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
> * which means L1 attempted VMEntry to L2 with invalid state.
> * Fail the VMEntry.
> */
> - if (CC(!vmx_guest_state_valid(vcpu))) {
> + if (!vmx->nested.smm.guest_mode && CC(!vmx_guest_state_valid(vcpu))) {
> *entry_failure_code = ENTRY_FAIL_DEFAULT;
> return -EINVAL;
> }
>