Re: [RFC 0/7] Prep code for better stack switching

From: Andy Lutomirski
Date: Sat Nov 11 2017 - 23:26:00 EST


On Sat, Nov 11, 2017 at 6:59 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
> On Sat, Nov 11, 2017 at 2:58 AM, Borislav Petkov <bp@xxxxxxx> wrote:
>> On Fri, Nov 10, 2017 at 08:05:19PM -0800, Andy Lutomirski wrote:
>>> This isn't quite done (the TSS remap patch is busted on 32-bit, but
>>> that's a straightforward fix), but it should be ready for at least a
>>> conceptual review.
>>>
>>> The idea here is to prepare us to have all kernel data needed for
>>> user mode execution and early entry located in the fixmap. To do
>>> this, I hijack the GDT remap mechanism and make it more general. I
>>> add a struct cpu_entry_area. This struct is never instantiated
>>> directly. Instead, it represents the layout of a per-cpu portion of
>>> the fixmap. That portion contains the GDT, the TSS (including IO
>>> bitmap), and the entry stack (for now just a part of the TSS
>>> region). It should also end up containing the PEBS and BTS buffers.
>>>
>>> If this works, then the idea would be to add a magic *executable* page
>>> to cpu_entry_area. That page would contain a stub like this:
>>>
>>> ENTRY(entry_SYSCALL_64_trampoline)
>>> UNWIND_HINT_EMPTY
>>> movq %rsp, 0x1000+entry_SYSCALL_64_trampoline-1f(%rip)
>>> 1:
>>> movq 0x1008+entry_SYSCALL_64_trampoline-1f(%rip), %rsp
>>> 1:
>>> pushq %rdi
>>> pushq %rsi
>>
>>> movq 0x1000+entry_SYSCALL_64_trampoline-1f(%rip), %rsi
>>> 1:
>>> movq $entry_SYSCALL_64, %rdi
>>> jmp *%rdi
>>
>> So I'm wondering: r12-r15 are callee-preserved so why can't you
>> scratch into those on entry and leave rsi and rdi pristine so that
>> entry_SYSCALL_64 can get to work directly?
>
> I'm not sure I understand your suggestion. SYSCALL has always
> preserved all regs except rcx, r11, flags, rax, and, depending on what
> signals are involved, the argument registers. r12-r15 are definitely
> preserved, and existing userspace relies on that.
>
> Anyway, I'm halfway through actually implementing this, and it looks a
> wee bit different, but not much different.


Here it is:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/entry_stack.wip&id=96a6ab74088a86f6b9b6df8284c6466e4fa50d08

Seems to work for me.

Dave, want to see if you can get this working cleanly without mapping
any percpu variables at all? You'll probably have to move PEBS, etc
into cpu_entry_area. For now, it should be safe to just ignore the
LDT. I'm somewhat tempted to just adjust your code so that the fixmap
ends up being mapped separately for LDT-using tasks rather than
mucking with putting the LDT in the user address range. The latter
involves a little more mm magic than I really want to deal with if I
can avoid it.