Re: [PATCH 1/2] KVM: nVMX: fix CR4_READ_SHADOW when L0 updates CR4 during a signal

From: Thomas Prescher
Date: Tue Apr 16 2024 - 13:32:05 EST


On Tue, 2024-04-16 at 08:17 -0700, Sean Christopherson wrote:
> On Tue, Apr 16, 2024, Thomas Prescher wrote:
> > Hi Sean,
> >
> > On Tue, 2024-04-16 at 07:35 -0700, Sean Christopherson wrote:
> > > On Tue, Apr 16, 2024, Julian Stecklina wrote:
> > > > From: Thomas Prescher <thomas.prescher@xxxxxxxxxxxxxxxxxxxxx>
> > > >
> > > > This issue occurs when the kernel is interrupted by a signal
> > > > while
> > > > running a L2 guest. If the signal is meant to be delivered to
> > > > the L0 VMM,
> > > > and L0 updates CR4 for L1, i.e. when the VMM sets
> > > > KVM_SYNC_X86_SREGS in
> > > > kvm_run->kvm_dirty_regs, the kernel programs an incorrect read
> > > > shadow
> > > > value for L2's CR4.
> > > >
> > > > The result is that the guest can read a value for CR4 where
> > > > bits from L1
> > > > have leaked into L2.
> > >
> > > No, this is a userspace bug.  If L2 is active when userspace
> > > stuffs
> > > register state, then from KVM's perspective the incoming value is
> > > L2's
> > > value.  E.g.  if userspace *wants* to update L2 CR4 for whatever
> > > reason,
> > > this patch would result in L2 getting a stale value, i.e. the
> > > value of CR4
> > > at the time of VM-Enter.
> > >
> > > And even if userspace wants to change L1, this patch is wrong, as
> > > KVM is
> > > writing vmcs02.GUEST_CR4, i.e. is clobbering the L2 CR4 that was
> > > programmed
> > > by L1, *and* is dropping the CR4 value that userspace wanted to
> > > stuff for
> > > L1.
> > >
> > > To fix this, your userspace needs to either wait until L2 isn't
> > > active, or
> > > force the vCPU out of L2 (which isn't easy, but it's doable if
> > > absolutely
> > > necessary).
> >
> > What you say makes sense. Is there any way for
> > userspace to detect whether L2 is currently active after
> > returning from KVM_RUN? I couldn't find anything in the official
> > documentation https://docs.kernel.org/virt/kvm/api.html
> >
> > Can you point me into the right direction?
>
> Hmm, the only way to query that information is via
> KVM_GET_NESTED_STATE, which is
> a bit unfortunate as that is a fairly "heavy" ioctl().

Indeed. What still does not make sense to me is that userspace should
be able to modify the L2 state. From what I can see, even in this
scenario, L0 exits with the L1 state. So what you are saying is that
userspace should be allowed to change L2 even if it receives the
architectural state from L1? What would be the use-case for this
scenario?

If the above is true, I think we should document this explicitly
because it's not obvious, at least not for me ;)

How would you feel if we extend struct kvm_run with a
nested_guest_active flag? This should be fairly cheap and would make
the integration into VirtualBox/KVM much easier. We could also only
sync this flag explicitly, e.g. with a KVM_SYNC_X86_NESTED_GUEST?