Re: [PATCH v3 2/2] VMX: reset the segment cache after segment initialization in vmx_vcpu_reset

From: Maxim Levitsky
Date: Mon Sep 09 2024 - 21:07:52 EST


On Mon, 2024-09-09 at 15:11 -0400, Maxim Levitsky wrote:
> On Fri, 2024-08-09 at 08:27 -0700, Sean Christopherson wrote:
> > On Tue, Aug 06, 2024, Chao Gao wrote:
> > > On Thu, Jul 25, 2024 at 01:52:32PM -0400, Maxim Levitsky wrote:
> > > > Fix this by moving the vmx_segment_cache_clear() call to be after the
> > > > segments are initialized.
> > > >
> > > > Note that this still doesn't fix the issue of kvm_arch_vcpu_in_kernel
> > > > getting stale data during the segment setup, and that issue will
> > > > be addressed later.
> > > >
> > > > Signed-off-by: Maxim Levitsky <mlevitsk@xxxxxxxxxx>
> > >
> > > Do you need a Fixes tag and/or Cc: stable?
> >
> > Heh, it's an old one
> >
> > Fixes: 2fb92db1ec08 ("KVM: VMX: Cache vmcs segment fields")
> >
> > > > ---
> > > > arch/x86/kvm/vmx/vmx.c | 6 +++---
> > > > 1 file changed, 3 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > > index fa9f307d9b18..d43bb755e15c 100644
> > > > --- a/arch/x86/kvm/vmx/vmx.c
> > > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > > @@ -4870,9 +4870,6 @@ void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > > vmx->hv_deadline_tsc = -1;
> > > > kvm_set_cr8(vcpu, 0);
> > > >
> > > > - vmx_segment_cache_clear(vmx);
> > > > - kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
> > > > -
> > > > seg_setup(VCPU_SREG_CS);
> > > > vmcs_write16(GUEST_CS_SELECTOR, 0xf000);
> > > > vmcs_writel(GUEST_CS_BASE, 0xffff0000ul);
> > > > @@ -4899,6 +4896,9 @@ void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > > vmcs_writel(GUEST_IDTR_BASE, 0);
> > > > vmcs_write32(GUEST_IDTR_LIMIT, 0xffff);
> > > >
> > > > + vmx_segment_cache_clear(vmx);
> > > > + kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
> > >
> > > vmx_segment_cache_clear() is called in a few other sites. I think at least the
> > > call in __vmx_set_segment() should be fixed, because QEMU may read SS.AR right
> > > after a write to it. if the write was preempted after the cache was cleared but
> > > before the new value being written into VMCS, QEMU would find that SS.AR held a
> > > stale value.
> >
> > Ya, I thought the plan was to go for a more complete fix[*]? This change isn't
> > wrong, but it's obviously incomplete, and will be unnecessary if the preemption
> > issue is resolved.
>
> Hi,
>
> I was thinking to keep it simple, since the issue is mostly theoretical after this fix,
> but I'll give this another try.
>
> Best regards,
> Maxim Levitsky
>
> > [*] https://lore.kernel.org/all/f183d215c903d4d1e85bf89e9d8b57dd6ce5c175.camel@xxxxxxxxxx
> >

Hi,

This is what I am thinking, after going over this issue again:

Pre-populating the cache and/or adding 'exited_in_kernel' will waste vmreads on *each* vmexit,
I worry that this is just not worth the mostly theoretical issue that we have.


Since the segment and the register cache only optimize the case of reading a same field twice or more,
I suspect that reading these fields always is worse performance wise than removing the segment cache
altogether and reading these fields again and again.


Finally all 3 places that read the segment cache, only access one piece of data (SS.AR or RIP),
thus it doesn't really matter if they see an old or a new value.

I mean in theory if userspace changes the SS's AR bytes out of the
blue, and then we get a preemption event, in theory as you say the old value is correct but it really
doesn't matter.

So IMHO, just ensuring that we invalidate the segment cache right after we do any changes is the simplest
solution.

I can in addition to that add a warning to kvm_register_is_available and vmx_segment_cache_test_set, that
will test that only SS.AR and RIP are read from the interrupt context, so that if in the future someone
attempts to read more fields, this issue can be re-evaluated.

Best regards,
Maxim Levitsky