Re: [PATCH 1/2] KVM: nVMX: fix CR4_READ_SHADOW when L0 updates CR4 during a signal

From: Sean Christopherson
Date: Tue Apr 16 2024 - 14:08:06 EST


On Tue, Apr 16, 2024, Thomas Prescher wrote:
> On Tue, 2024-04-16 at 08:17 -0700, Sean Christopherson wrote:
> > On Tue, Apr 16, 2024, Thomas Prescher wrote:
> > > Hi Sean,
> > >
> > > On Tue, 2024-04-16 at 07:35 -0700, Sean Christopherson wrote:
> > > > On Tue, Apr 16, 2024, Julian Stecklina wrote:
> > > > > From: Thomas Prescher <thomas.prescher@xxxxxxxxxxxxxxxxxxxxx>
> > > > >
> > > > > This issue occurs when the kernel is interrupted by a signal while
> > > > > running a L2 guest. If the signal is meant to be delivered to the L0
> > > > > VMM, and L0 updates CR4 for L1, i.e. when the VMM sets
> > > > > KVM_SYNC_X86_SREGS in kvm_run->kvm_dirty_regs, the kernel programs an
> > > > > incorrect read shadow value for L2's CR4.
> > > > >
> > > > > The result is that the guest can read a value for CR4 where bits from
> > > > > L1 have leaked into L2.
> > > >
> > > > No, this is a userspace bug.  If L2 is active when userspace stuffs
> > > > register state, then from KVM's perspective the incoming value is L2's
> > > > value.  E.g.  if userspace *wants* to update L2 CR4 for whatever
> > > > reason, this patch would result in L2 getting a stale value, i.e. the
> > > > value of CR4 at the time of VM-Enter.
> > > >
> > > > And even if userspace wants to change L1, this patch is wrong, as KVM
> > > > is writing vmcs02.GUEST_CR4, i.e. is clobbering the L2 CR4 that was
> > > > programmed by L1, *and* is dropping the CR4 value that userspace wanted
> > > > to stuff for L1.
> > > >
> > > > To fix this, your userspace needs to either wait until L2 isn't active,
> > > > or force the vCPU out of L2 (which isn't easy, but it's doable if
> > > > absolutely necessary).
> > >
> > > What you say makes sense. Is there any way for
> > > userspace to detect whether L2 is currently active after
> > > returning from KVM_RUN? I couldn't find anything in the official
> > > documentation https://docs.kernel.org/virt/kvm/api.html
> > >
> > > Can you point me into the right direction?
> >
> > Hmm, the only way to query that information is via KVM_GET_NESTED_STATE,
> > which is a bit unfortunate as that is a fairly "heavy" ioctl().

Hur dur, I forgot that KVM provides a "guest_mode" stat. Userspace can do
KVM_GET_STATS_FD on the vCPU FD to get a file handle to the binary stats, and
then you wouldn't need to call back into KVM just to query guest_mode.

Ah, and I also forgot that we have kvm_run.flags, so adding KVM_RUN_X86_GUEST_MODE
would also be trivial (I almost suggested it earlier, but didn't want to add a
new field to kvm_run without a very good reason).

> Indeed. What still does not make sense to me is that userspace should be able
> to modify the L2 state. From what I can see, even in this scenario, L0 exits
> with the L1 state.

No, KVM exits with L2. Or at least, KVM is supposed to exit with L2 state. Exiting
with L1 state would actually be quite difficult, as KVM would need to manually
switch to vmcs01, pull out state, and then switch back to vmcs02. And for GPRs,
KVM doesn't have L1 state because most GPRs are "volatile", i.e. are clobbered by
VM-Enter and thus need to be manually managed by the hypervisor.

I assume you're using kvm_run.kvm_valid_regs to instruct KVM to save vCPU state
when exiting to userspace? That path grabs state straight from the vCPU, without
regard to is_guest_mode(). If you're seeing L1 state, then there's likely a bug
somewhere, likely in userspace again. While valid_regs kvm_run.kvm_valid_regs
might not be heavily used, the underlying code is very heavily used for doing
save/restore while L2 is active, e.g. for live migration.

> So what you are saying is that userspace should be allowed to change L2 even
> if it receives the architectural state from L1?

No, see above.

> What would be the use-case for this scenario?
>
> If the above is true, I think we should document this explicitly
> because it's not obvious, at least not for me ;)
>
> How would you feel if we extend struct kvm_run with a
> nested_guest_active flag? This should be fairly cheap and would make
> the integration into VirtualBox/KVM much easier. We could also only
> sync this flag explicitly, e.g. with a KVM_SYNC_X86_NESTED_GUEST?

Heh, see above regarding KVM_RUN_X86_GUEST_MODE.