Re: [PATCH v2 7/9] x86: Use current USER_CS to setup correct context on vmx entry

From: Andy Lutomirski
Date: Fri Oct 02 2020 - 20:15:28 EST


On Fri, Oct 2, 2020 at 3:40 PM Sean Christopherson
<sean.j.christopherson@xxxxxxxxx> wrote:
>
> On Thu, Oct 01, 2020 at 02:52:32PM -0700, Andy Lutomirski wrote:
> > On Thu, Oct 1, 2020 at 1:59 PM Gabriel Krisman Bertazi
> > <krisman@xxxxxxxxxxxxx> wrote:
> > >
> > > vmx_prepare_switch_to_guest shouldn't use is_64bit_mm, which has a
> > > very specific use in uprobes. Use the user_64bit_mode helper instead.
> > > This reduces the usage of is_64bit_mm, which is awkward, since it relies
> > > on the personality at load time, which is fine for uprobes, but doesn't
> > > seem fine here.
> > >
> > > I tested this by running VMs with 64 and 32 bits payloads from 64/32
> > > programs.
> > >
> > > Signed-off-by: Gabriel Krisman Bertazi <krisman@xxxxxxxxxxxxx>
> > > ---
> > > arch/x86/kvm/vmx/vmx.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index 7b2a068f08c1..b5aafd9e5f5d 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -1172,7 +1172,7 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> > > savesegment(es, host_state->es_sel);
> > >
> > > gs_base = cpu_kernelmode_gs_base(cpu);
> > > - if (likely(is_64bit_mm(current->mm))) {
> > > + if (likely(user_64bit_mode(current_pt_regs()))) {
> > > current_save_fsgs();
> > > fs_sel = current->thread.fsindex;
> > > gs_sel = current->thread.gsindex;
> >
> > I disagree with this one. This whole code path is nonsense. Can you
> > just remove the condition entirely and use the 64-bit path
> > unconditionally?
>
> I finally came back to this one with fresh eyes. I've read through the code
> a bajllion times and typed up half a dozen responses. I think, finally, I
> understand what's broken.
>
> I believe your original assertion that the bug was misdiagnosed is correct
> (can't link because LKML wasn't in the loop). I'm pretty sure your analysis
> that KVM's handling of things works mostly by coincidence is also correct.
>
> The coincidence is that "real" VMMs all use arch_prctl(), and
> do_arch_prctl_64() ensures thread.{fs,gs}base are accurate. save_base_legacy()
> detects sel==0 and intentionally does nothing, knowing the the base is already
> accurate.
>
> Userspaces that don't use arch_prctl(), in the bug report case a 32-bit compat
> test, may or may not have accurate thread.{fs,gs}base values. This is
> especially true if sel!=0 as save_base_legacy() explicitly zeros the base in
> this case, as load_seg_legacy() will restore the seg on the backend.
>
> KVM on the other hand assumes thread.{fs,gs}base are always fresh. When that
> didn't hold true for userspace that didn't use arch_prctl(), the fix of
> detecting a !64-bit mm just so happened to work because all 64-bit VMMs use
> arch_prctl().
>
> It's tempting to just open code this and use RD{FS,GS}BASE when possible,
> i.e. avoid any guesswork. Maybe with a module param that userspace can set
> to tell KVM it doesn't do anything fancy with FS/GS base (will userspace still
> use arch_prctl() even if FSGSABSE is available?).
>
> savesegment(fs, fs_sel);
> savesegment(gs, gs_sel);
> if (use_current_fsgs_base) {
> fs_base = current->thread.fsbase;
> vmx->msr_host_kernel_gs_base = current->thread.gsbase;

I don't like this. The FSGSBASE case is fast, and the !FSGSBASE case
is mostly obsolete. I see no great reason to have a module parameter
asking for incorrect behavior. There have been too many bugs in this
area -- let's not add more please.

> } else if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
> fs_base = rdfsbase()
> vmx->msr_host_kernel_gs_base = __rdgsbase_inactive();
> } else {
> fs_base = read_msr(MSR_FS_BASE);
> vmx->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
> }

I'm okay with this, but I think you should fix the corresponding bugs
on the restore side as well. The current code is:

if (host_state->ldt_sel || (host_state->gs_sel & 7)) {
kvm_load_ldt(host_state->ldt_sel);
#ifdef CONFIG_X86_64
load_gs_index(host_state->gs_sel);
#else
loadsegment(gs, host_state->gs_sel);
#endif
}
if (host_state->fs_sel & 7)
loadsegment(fs, host_state->fs_sel);

which is blatantly wrong in the case where fs_set.TI == 1, gs_set.TI
== 0, and ldt_sel != 0. But it's also more subtly wrong -- this
corrupts all the segment attributes in the case where a segment points
to the GDT and the GDT attributes are non-default. So I would suggest
making the code radically simpler and more correct:

if (host_state->idt_sel)
kvm_load_idt(host_state->idt_sel); // see below
if (host_state->ds_sel)
loadsegment(ds, host_state->ds_sel);
if (host_state->es_sel)
loadsegment(es, host_state->es_sel);
if (host_state->fs_sel) {
loadsegment(fs, host_state->fs_sel);
x86_fsbase_write_cpu(host_state->fs_base);
}
if (host_state->gs_sel) {
load_gs_index(host_state->gs_sel);
x86_gsbase_write_cpu_inactive(host_state->msr_host_kernel_gs_base);
}

In the IMO unlikely event that any performance-critical KVM userspace
runs with these selectors != 0, you could also skip the load if they
are set to __USER_DS.

You can also simplify this crud:

if (unlikely(fs_sel != host->fs_sel)) {
if (!(fs_sel & 7))
vmcs_write16(HOST_FS_SELECTOR, fs_sel);
else
vmcs_write16(HOST_FS_SELECTOR, 0);
host->fs_sel = fs_sel;
}

There is nothing special about segments with TI set according to the
SDM (AFAICT) nor is anything particularly special about them in
Linux's segment tables, so this code makes little sense. It could
just be:

host->fs_sel = fs_sel;

and leave the VMCS field set to 0. Or if you do the __USER_DS
optimization above, you could do:

if (unlikely(fs_sel != host->fs_sel)) {
vmcs_write16(HOST_FS_SELECTOR, fs_sel == __USER_DS ? __USER_DS : 0);
host->fs_sel = fs_sel;
}

I suspect that the only users who care about performance (or for whom
we should care about performance) are normal userspace, and normal
64-bit userspace has DS == ES == FS == GS == 0.


I would also be okay with making the KVM code match the context switch
code, but this may be distinctly nontrivial.