Re: [PATCH 2/2] KVM: x86: Allow userspace to update tracked sregs for protected guests
From: Sean Christopherson
Date: Mon May 10 2021 - 18:40:26 EST
On Mon, May 10, 2021, Tom Lendacky wrote:
> On 5/10/21 4:02 PM, Sean Christopherson wrote:
> > On Mon, May 10, 2021, Tom Lendacky wrote:
> >> On 5/10/21 11:10 AM, Sean Christopherson wrote:
> >>> On Fri, May 07, 2021, Tom Lendacky wrote:
> >>>> On 5/7/21 11:59 AM, Sean Christopherson wrote:
> >>>>> Allow userspace to set CR0, CR4, CR8, and EFER via KVM_SET_SREGS for
> >>>>> protected guests, e.g. for SEV-ES guests with an encrypted VMSA. KVM
> >>>>> tracks the aforementioned registers by trapping guest writes, and also
> >>>>> exposes the values to userspace via KVM_GET_SREGS. Skipping the regs
> >>>>> in KVM_SET_SREGS prevents userspace from updating KVM's CPU model to
> >>>>> match the known hardware state.
> >>>>
> >>>> This is very similar to the original patch I had proposed that you were
> >>>> against :)
> >>>
> >>> I hope/think my position was that it should be unnecessary for KVM to need to
> >>> know the guest's CR0/4/0 and EFER values, i.e. even the trapping is unnecessary.
> >>> I was going to say I had a change of heart, as EFER.LMA in particular could
> >>> still be required to identify 64-bit mode, but that's wrong; EFER.LMA only gets
> >>> us long mode, the full is_64_bit_mode() needs access to cs.L, which AFAICT isn't
> >>> provided by #VMGEXIT or trapping.
> >>
> >> Right, that one is missing. If you take a VMGEXIT that uses the GHCB, then
> >> I think you can assume we're in 64-bit mode.
> >
> > But that's not technically guaranteed. The GHCB even seems to imply that there
> > are scenarios where it's legal/expected to do VMGEXIT with a valid GHCB outside
> > of 64-bit mode:
> >
> > However, instead of issuing a HLT instruction, the AP will issue a VMGEXIT
> > with SW_EXITCODE of 0x8000_0004 ((this implies that the GHCB was updated prior
> > to leaving 64-bit long mode).
>
> Right, but in order to fill in the GHCB so that the hypervisor can read
> it, the guest had to have been in 64-bit mode. Otherwise, whatever the
> guest wrote will be seen as encrypted data and make no sense to the
> hypervisor anyway.
Having once been in 64-bit is not the same thing as currently being in 64-bit.
E.g. if the VMM _knows_ that the vCPU is not in 64-bit, e.g. because it traps
writes to CR0, then ignoring bits 63:32 of GPRs would be entirely reasonable.
In practice it probably won't matter since the guest will have to go out of its
way to do VMGEXIT outside of 64-bit mode with valid data in the GHCB, but ideally
KVM will conform to a spec, not implement a best guess as to what will/won't work
for the guest.
> >>> I.e. either the ghcb_cpl_is_valid() check should be nuked, or more likely KVM
> >>
> >> The ghcb_cpl_is_valid() is still needed to see whether the VMMCALL was
> >> from userspace or not (a VMMCALL will generate a #VC).
> >
> > Blech. I get that the GHCB spec says CPL must be provided/checked for VMMCALL,
> > but IMO that makes no sense whatsover.
> >
> > If the guest restricts the GHCB to CPL0, then the CPL field is pointless because
> > the VMGEXIT will only ever come from CPL0. Yes, technically the guest kernel
> > can proxy a VMMCALL from userspace to the host, but the guest kernel _must_ be
> > the one to enforce any desired CPL checks because the VMM is untrusted, at least
> > once you get to SNP.
> >
> > If the guest exposes the GHCB to any CPL, then the CPL check is worthless because
>
> The GHCB itself is not exposed to any CPL.
That is a guest decision. Nothing in the GHCB spec requires the GHCB to be
accessible only in CPL0. I'm not arguing that any sane guest will actually map
the GHCB into userspace, but from an architectural perspective it's not a given.
> A VMMCALL will generate a #VC.
But that's irrevelant from an architectural perspective. It is 100% legal to
do VMGEXIT(VMMCALL) without first doing VMMCALL and taking a #VC.
> The guest #VC handler will extract the CPL level from the context that
> generated the #VC (see vc_handle_vmmcall() in arch/x86/kernel/sev-es.c),
> so that a VMMCALL from userspace will have the proper CPL value in the
> GHCB when the #VC handler issues the VMGEXIT instruction.
I get how the CPL ended up as a requirement for VMMCALL in the GHCB, I'm arguing
that it's useless information. From the guest's perspective, it must handle
privilege checks for those checks to have any meaning with respect to security.
>From the VMM's perspective, the checks have zero meaning whatsoever because the
CPL field is software controlled. The fact that the VMM got a VMGEXIT with
valid data in the GHCB is proof enough that the guest blessed the VMGEXIT.
It's kind of a moot point because the extra CPL doesn't break anything, but it
gives a false sense of security since it implies the guest can/should rely on
the VMM to do privilege enforcement.
Actually, after looking at the Linux guest code, the entire VMCALL->#VC->VMGEXIT
concept is flawed. E.g. kvm_sev_es_hcall_prepare() copies over all _possible_
GPRs. That means users of kvm_hypercall{1,2,3}() are leaking random register
state to the VMM. That can be "fixed" by zeroing unused registers or reflecting
on regs->ax, but at that point it's far easier and more performant to simply do
VMGEXIT(VMCALL) directly and skip the #VC.
As for VMMCALL from userspace, IMO the kernel's default should be to reject them
unconditionally. KVM and Hyper-V APIs do not allow hypercalls from CPL>0. At a
glance, neither does Xen (Xen skips the CPL check, but AFAICT only when the vCPU
is in real mode and thus the CPL check is meaningless). No idea about VMware.
If there is a use case for allowing VMMCALL from userspace, then it absolutely
needs to be handled in a x86_hyper_runtime hook, because otherwise the kernel
has zero clue whether or not the hypercall should be allowed, and if it's
allowed, precisely which registers to expose. I.e. the hook would have to
expose only the registers needed for that exact hypercall, otherwise the kernel
would unintentionally leak userspace register state.