Re: [PATCH 4/5] x86: entry_64.S: always allocate complete "struct pt_regs"

From: Denys Vlasenko
Date: Tue Aug 05 2014 - 11:18:29 EST


On 08/05/2014 04:53 PM, Andy Lutomirski wrote:
> On Aug 5, 2014 7:36 PM, "Denys Vlasenko" <vda.linux@xxxxxxxxxxxxxx> wrote:
>>
>> On Mon, Aug 4, 2014 at 11:03 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>>> Next up: remove FIXUP/RESTORE_TOP_OF_STACK? :) Maybe I'll give that a shot.
>>>>
>>>> I'm yet at the stage "what that stuff does anyway?" and at
>>>> "why do we need percpu old_rsp thingy?" in particular.
>>>
>>> On x86_64, the syscall instruction has no effect on rsp. That means
>>> that the entry point starts out with no stack. There are no free
>>> registers whatsoever at the entry point.
>>>
>>> That means that the entry code needs to do swapgs, stash rsp somewhere
>>> relative to gs, and then load the kernel's rsp. old_rsp is the spot
>>> used for this.
>>>
>>> Now the kernel does an optimization that is, I think, very much not
>>> worth it. The kernel doesn't bother sticking the old rsp value into
>>> pt_regs (saving two instructions on fast path entries) and doesn't
>>> initialize the SS, CS, RCX, and EFLAGS fields in pt_regs, saving four
>>> more instructions.
>>>
>>> To make this optimization work, the whole FIXUP/RESTORE_TOP_OF_STACK
>>> dance is needed, and there's the usersp crap in the context switch
>>> code, and current_user_stack_pointer(), and probably even more crap
>>> that I haven't noticed. And I sure hope that nothing in the *compat*
>>> syscall path touches current_user_stack_pointer(), because the compat
>>> code doesn't seem to use old_rsp.
>>>
>>> I think this should all be ripped out. The only real difficulty will
>>> be that the sysret code needs to restore rsp itself, so the sysret
>>> path will end up needing two more instructions. Removing all of the
>>> TOP_OF_STACK stuff will add ten instructions to fast path syscalls,
>>> and I wouldn't be surprised if this adds considerably fewer than ten
>>> cycles on any modern chip.
>>
>> Something like this on the fast path? -
>>
>> SWAPGS_UNSAFE_STACK
>> movq %rsp,PER_CPU_VAR(old_rsp)
>> movq PER_CPU_VAR(kernel_stack),%rsp
>> ENABLE_INTERRUPTS(CLBR_NONE)
>> ALLOC_PTREGS_ON_STACK 8 /* +8: space for orig_ax */
>> SAVE_C_REGS
>> movq %rax,ORIG_RAX(%rsp)
>> movq %rcx,RIP(%rsp)
>> + movq %r11,EFLAGS(%rsp)
>> + movq PER_CPU_VAR(old_rsp),%rcx
>> + movq %rcx,RSP(%rsp)
>> ...
>> - RESTORE_C_REGS_EXCEPT_RCX
>> + RESTORE_C_REGS_EXCEPT_RCX_R11
>> movq RIP(%rsp),%rcx
>> + movq EFLAGS(%rsp), %r11
>> - movq PER_CPU_VAR(old_rsp), %rsp
>> + movq RSP(%rsp), %rsp
>> USERGS_SYSRET64
>
> The sysret code still needs the inverse, right?

The part after "..." in my skecth is the sysret code.

> ptrace can change RSP.

By writing to pt_regs->rsp, yes. And the above code
will pick it up - we read RSP(%rsp).

>> Do we need to save rcx and r11 in "struct pt_regs" in their
>> "standard" slots, though?
>
> ptrace probably wants it.

Let's see.
They don't contain any useful information:
With current code,
pt_regs->r11 is the same as pt_regs->rflags,
pt_regs->rcx is the same as pt_regs->rip (modulo weird store of -1).
So reading them by ptrace is... weird - just read
pt_regs->rflags or pt_regs->rip instead!

If ptrace is active, we'll return via iretq.
If ptrace wrote to these pt_regs members, on return
to userspace current code restores modified values.
My proposed change does not change this.

So, only ptrace reads of rcx and r11 will be affected.
Hmm. Maybe we can fill them only on "tracesys:" codepath?

>> Then old_rsp can be nuked everywhere else,
>> RESTORE_TOP_OF_STACK can be nuked, and
>> FIXUP_TOP_OF_STACK can be reduced to merely:
>>
>> movq $__USER_DS,SS(%rsp)
>> movq $__USER_CS,CS(%rsp)
>
> Mmm, right. That's probably better than doing this on the fast path.
>
>>
>> (BTW, why currently it does "movq $-1,RCX+\offset(%rsp)?)
>
> I would argue this is a bug.

Agree.

> (In fact, I have a patch floating around
> to fix it. The current code is glitchy in a visible-to-user-space
> way.) We should put rcx into both RIP and RCX, since the sysret path
> will implicitly do that, and we should restore the same register
> values in the iret and sysret paths.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/