Re: [PATCH 2/2] KVM: x86: Allow userspace to update tracked sregs for protected guests

From: Tom Lendacky
Date: Mon May 10 2021 - 17:23:46 EST


On 5/10/21 4:02 PM, Sean Christopherson wrote:
> On Mon, May 10, 2021, Tom Lendacky wrote:
>> On 5/10/21 11:10 AM, Sean Christopherson wrote:
>>> On Fri, May 07, 2021, Tom Lendacky wrote:
>>>> On 5/7/21 11:59 AM, Sean Christopherson wrote:
>>>>> Allow userspace to set CR0, CR4, CR8, and EFER via KVM_SET_SREGS for
>>>>> protected guests, e.g. for SEV-ES guests with an encrypted VMSA. KVM
>>>>> tracks the aforementioned registers by trapping guest writes, and also
>>>>> exposes the values to userspace via KVM_GET_SREGS. Skipping the regs
>>>>> in KVM_SET_SREGS prevents userspace from updating KVM's CPU model to
>>>>> match the known hardware state.
>>>>
>>>> This is very similar to the original patch I had proposed that you were
>>>> against :)
>>>
>>> I hope/think my position was that it should be unnecessary for KVM to need to
>>> know the guest's CR0/4/0 and EFER values, i.e. even the trapping is unnecessary.
>>> I was going to say I had a change of heart, as EFER.LMA in particular could
>>> still be required to identify 64-bit mode, but that's wrong; EFER.LMA only gets
>>> us long mode, the full is_64_bit_mode() needs access to cs.L, which AFAICT isn't
>>> provided by #VMGEXIT or trapping.
>>
>> Right, that one is missing. If you take a VMGEXIT that uses the GHCB, then
>> I think you can assume we're in 64-bit mode.
>
> But that's not technically guaranteed. The GHCB even seems to imply that there
> are scenarios where it's legal/expected to do VMGEXIT with a valid GHCB outside
> of 64-bit mode:
>
> However, instead of issuing a HLT instruction, the AP will issue a VMGEXIT
> with SW_EXITCODE of 0x8000_0004 ((this implies that the GHCB was updated prior
> to leaving 64-bit long mode).

Right, but in order to fill in the GHCB so that the hypervisor can read
it, the guest had to have been in 64-bit mode. Otherwise, whatever the
guest wrote will be seen as encrypted data and make no sense to the
hypervisor anyway.

>
> In practice, assuming the guest is in 64-bit mode will likely work, especially
> since the MSR-based protocol is extremely limited, but ideally there should be
> stronger language in the GHCB to define the exact VMM assumptions/behaviors.
>
> On the flip side, that assumption and the limited exposure through the MSR
> protocol means trapping CR0, CR4, and EFER is pointless. I don't see how KVM
> can do anything useful with that information outside of VMGEXITs. Page tables
> are encrypted and GPRs are stale; what else could KVM possibly do with
> identifying protected mode, paging, and/or 64-bit?
>
>>> Unless I'm missing something, that means that VMGEXIT(VMMCALL) is broken since
>>> KVM will incorrectly crush (or preserve) bits 63:32 of GPRs. I'm guessing no
>>> one has reported a bug because either (a) no one has tested a hypercall that
>>> requires bits 63:32 in a GPR or (b) the guest just happens to be in 64-bit mode
>>> when KVM_SEV_LAUNCH_UPDATE_VMSA is invoked and so the segment registers are
>>> frozen to make it appear as if the guest is perpetually in 64-bit mode.
>>
>> I don't think it's (b) since the LAUNCH_UPDATE_VMSA is done against reset-
>> state vCPUs.
>>
>>>
>>> I see that sev_es_validate_vmgexit() checks ghcb_cpl_is_valid(), but isn't that
>>> either pointless or indicative of a much, much bigger problem? If VMGEXIT is
>>
>> It is needed for the VMMCALL exit.
>>
>>> restricted to CPL0, then the check is pointless. If VMGEXIT isn't restricted to
>>> CPL0, then KVM has a big gaping hole that allows a malicious/broken guest
>>> userspace to crash the VM simply by executing VMGEXIT. Since valid_bitmap is
>>> cleared during VMGEXIT handling, I don't think guest userspace can attack/corrupt
>>> the guest kernel by doing a replay attack, but it does all but guarantee a
>>> VMGEXIT at CPL>0 will be fatal since the required valid bits won't be set.
>>
>> Right, so I think some cleanup is needed there, both for the guest and the
>> hypervisor:
>>
>> - For the guest, we could just clear the valid bitmask before leaving the
>> #VC handler/releasing the GHCB. Userspace can't update the GHCB, so any
>> VMGEXIT from userspace would just look like a no-op with the below
>> change to KVM.
>
> Ah, right, the exit_code and exit infos need to be valid.
>
>> - For KVM, instead of returning -EINVAL from sev_es_validate_vmgexit(), we
>> return the #GP action through the GHCB and continue running the guest.
>
> Agreed, KVM should never kill the guest in response to a bad VMGEXIT. That
> should always be a guest decision.
>
>>> Sadly, the APM doesn't describe the VMGEXIT behavior, nor does any of the SEV-ES
>>> documentation I have. I assume VMGEXIT is recognized at CPL>0 since it morphs
>>> to VMMCALL when SEV-ES isn't active.
>>
>> Correct.
>>
>>>
>>> I.e. either the ghcb_cpl_is_valid() check should be nuked, or more likely KVM
>>
>> The ghcb_cpl_is_valid() is still needed to see whether the VMMCALL was
>> from userspace or not (a VMMCALL will generate a #VC).
>
> Blech. I get that the GHCB spec says CPL must be provided/checked for VMMCALL,
> but IMO that makes no sense whatsover.
>
> If the guest restricts the GHCB to CPL0, then the CPL field is pointless because
> the VMGEXIT will only ever come from CPL0. Yes, technically the guest kernel
> can proxy a VMMCALL from userspace to the host, but the guest kernel _must_ be
> the one to enforce any desired CPL checks because the VMM is untrusted, at least
> once you get to SNP.
>
> If the guest exposes the GHCB to any CPL, then the CPL check is worthless because

The GHCB itself is not exposed to any CPL. A VMMCALL will generate a #VC.
The guest #VC handler will extract the CPL level from the context that
generated the #VC (see vc_handle_vmmcall() in arch/x86/kernel/sev-es.c),
so that a VMMCALL from userspace will have the proper CPL value in the
GHCB when the #VC handler issues the VMGEXIT instruction.

Thanks,
Tom

> guest userspace can simply lie about the CPL. And exposing the GCHB to userspace
> completely undermines guest privilege separation since hardware doesn't provide
> the real CPL, i.e. the VMM, even it were trusted, can't determine the origin of
> the VMGEXIT.
>