Re: [PATCH 02/16] x86/entry/32: Enter the kernel via trampoline stack

From: Joerg Roedel
Date: Wed Jan 17 2018 - 04:19:03 EST


On Tue, Jan 16, 2018 at 02:45:27PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 16, 2018 at 8:36 AM, Joerg Roedel <joro@xxxxxxxxxx> wrote:
> > +.macro SWITCH_TO_KERNEL_STACK nr_regs=0 check_user=0
>
> How about marking nr_regs with :req to force everyone to be explicit?

Yeah, that's more readable, I'll change it.

> > + /*
> > + * TSS_sysenter_stack is the offset from the bottom of the
> > + * entry-stack
> > + */
> > + movl TSS_sysenter_stack + ((\nr_regs + 1) * 4)(%esp), %esp
>
> This is incomprehensible. You're adding what appears to be the offset
> of sysenter_stack within the TSS to something based on esp and
> dereferencing that to get the new esp. That't not actually what
> you're doing, but please change asm_offsets.c (as in my previous
> email) to avoid putting serious arithmetic in it and then do the
> arithmetic right here so that it's possible to follow what's going on.

Probably this needs better comments. So TSS_sysenter_stack is the offset
from to tss.sp0 (tss.sp1 later) from the _bottom_ of the stack. But in
this macro the stack might not be empty, it has a configurable (by
\nr_regs) number of dwords on it. Before this instruction we also do a
push %edi, so we need (\nr_regs + 1).

This can't be put into asm_offset.c, as the actual offset depends on how
much is on the stack.

> > ENTRY(entry_INT80_32)
> > ASM_CLAC
> > pushl %eax /* pt_regs->orig_ax */
> > +
> > + /* Stack layout: ss, esp, eflags, cs, eip, orig_eax */
> > + SWITCH_TO_KERNEL_STACK nr_regs=6 check_user=1
> > +
>
> Why check_user?

You are right, check_user shouldn't ne needed as INT80 is never called
from kernel mode.

> > ENTRY(nmi)
> > ASM_CLAC
> > +
> > + /* Stack layout: ss, esp, eflags, cs, eip */
> > + SWITCH_TO_KERNEL_STACK nr_regs=5 check_user=1
>
> This is wrong, I think. If you get an nmi in kernel mode but while
> still on the sysenter stack, you blow up. IIRC we have some crazy
> code already to handle this (for nmi and #DB), and maybe that's
> already adequate or can be made adequate, but at the very least this
> needs a big comment explaining why it's okay.

If we get an nmi while still on the sysenter stack, then we are not
entering the handler from user-space and the above code will do
nothing and behave as before.

But you are right, it might blow up. There is a problem with the cr3
switch, because the nmi can happen in kernel mode before the cr3 is
switched, then this handler will not do the cr3 switch itself and crash
the kernel. But the stack switching should be fine, I think.

> > + /*
> > + * TODO: Find a way to let cpu_current_top_of_stack point to
> > + * cpu_tss_rw.x86_tss.sp1. Doing so now results in stack corruption with
> > + * iret exceptions.
> > + */
> > + this_cpu_write(cpu_tss_rw.x86_tss.sp1, next_p->thread.sp0);
>
> Do you know what the issue is?

No, not yet, I will look into that again. But first I want to get
this series stable enough as it is.

> As a general comment, the interaction between this patch and vm86 is a
> bit scary. In vm86 mode, the kernel gets entered with extra stuff on
> the stack, which may screw up all your offsets.

Just read up on vm86 mode control transfers and the stack layout then.
Looks like I need to check for eflags.vm=1 and copy four more registers
from/to the entry stack. Thanks for pointing that out.

Thanks,

Joerg