[RFC 0/7] Prep code for better stack switching

From: Andy Lutomirski
Date: Fri Nov 10 2017 - 23:06:07 EST

Next message: Andy Lutomirski: "[RFC 2/7] x86/gdt: Put per-cpu GDT remaps in ascending order"
Previous message: Steven Rostedt: "Re: [PATCH] tcp: Export to userspace the TCP state names for the trace events"
Next in thread: Andy Lutomirski: "[RFC 2/7] x86/gdt: Put per-cpu GDT remaps in ascending order"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This isn't quite done (the TSS remap patch is busted on 32-bit, but
that's a straightforward fix), but it should be ready for at least a
conceptual review.

The idea here is to prepare us to have all kernel data needed for
user mode execution and early entry located in the fixmap. To do
this, I hijack the GDT remap mechanism and make it more general. I
add a struct cpu_entry_area. This struct is never instantiated
directly. Instead, it represents the layout of a per-cpu portion of
the fixmap. That portion contains the GDT, the TSS (including IO
bitmap), and the entry stack (for now just a part of the TSS
region). It should also end up containing the PEBS and BTS buffers.

If this works, then the idea would be to add a magic *executable* page
to cpu_entry_area. That page would contain a stub like this:

ENTRY(entry_SYSCALL_64_trampoline)
UNWIND_HINT_EMPTY
movq %rsp, 0x1000+entry_SYSCALL_64_trampoline-1f(%rip)
1:
movq 0x1008+entry_SYSCALL_64_trampoline-1f(%rip), %rsp
1:
pushq %rdi
pushq %rsi
movq 0x1000+entry_SYSCALL_64_trampoline-1f(%rip), %rsi
1:
movq $entry_SYSCALL_64, %rdi
jmp *%rdi
END(entry_SYSCALL_64_trampoline)

(Those offsets are made up. In real life, they'd be computed using
asm-offsets so they refer to the top word of the entry stack and to
the word that contains the real kernel stack address, respectively.)

We'd now enter entry_SYSCALL_64 (probably renamed) on the real task
stack, with user RDI and RSI on that stack (and in need of popping)
and with user RSP in RSI. This is weird, but it gives us some major
benefits:

- This entire sequence works without any %gs prefixes and without
touching the conventional percpu mappings. This means that it
will work without mapping any conventional percpu data. That
removes a considerable amount of complexity in Dave's series and
also removes a giant kASLR hole in that Dave's series, as is,
leaks the location of all the percpu mappings.

- We run the SYSCALL entry code in a context in which it has
easy access to scratch space for its CR3 shenanigans.

- I've carefully done this without needing access to the
cpu_entry_area from the post-trampoline entry code. Finding
it would require awkward calculations, a percpu load from
an otherwise unneeded cacheline, or a potentially unfortunate
load of the valule we just stored from a different VA alias. I
imagine that the last one is nasy from a microarchitectural
perspective.

I'd really like to do this in a way that makes it optional so that,
if KAISER is disabled, we don't take the TLB miss overhead, which
probably outweighs the minor speedup that we no longer stall on
SWAPGS. OTOH, it might end up benchmarking faster than the current
code, since, while it's harder on I$ and the TLB, it's easier on D$
(avoids two conventional percpu accesses, instead using a cacheline
that's needed anyway for the stack0.

The same exact treatment is used for SYSCALL32.

If I didn't forget some detail, this would allow KAISER to function
with only the fixmap, the entry text, and the espfix64 junk mapped.
Down the road, we could further tweak it to get rid of the entry
text too by moving all the CR3-switching code into the fixmap.

The ORC unwinder would need to learn about this special case to be
able to unwind an NMI that hits in the trampoline. Or maybe we
don't care. kallsyms might also want to hackery to recognize
the trampoline for perf's benefit.

Open questions:

- Should the entry stack be anywhere near as big as I made it here?
If I keep it very small, then inappropriate uses of it would be
immediately detected as (properly backtraced) double faults.

- Something should IMO complain very loudly, at least with debugging on,
if we accidentally schedule from the entry stack. As is, it causes
huge corruption but doesn't immediately die.

- This is incompatible with the PIE effort. We'd have to use movabs
instead of movq, but I don't know whether the tooling can handle
the resulting relocation.

Andy Lutomirski (7):
x86/asm/64: Allocate and enable the SYSENTER stack
x86/gdt: Put per-cpu GDT remaps in ascending order
x86/fixmap: Generalize the GDT fixmap mechanism
x86/asm: Fix assumptions that the HW TSS is at the beginning of
cpu_tss
x86/asm: Rearrange struct cpu_tss to enlarge SYSENTER_stack and fix
alignment
x86/asm: Remap the TSS into the cpu entry area
x86/unwind/64: Add support for the SYSENTER stack

arch/x86/entry/entry_64_compat.S | 2 +-
arch/x86/include/asm/desc.h | 11 ++--------
arch/x86/include/asm/fixmap.h | 43 +++++++++++++++++++++++++++++++++++--
arch/x86/include/asm/processor.h | 25 +++++++++++-----------
arch/x86/include/asm/stacktrace.h | 1 +
arch/x86/kernel/asm-offsets.c | 5 +++++
arch/x86/kernel/asm-offsets_32.c | 5 -----
arch/x86/kernel/cpu/common.c | 45 +++++++++++++++++++++++++++++----------
arch/x86/kernel/doublefault.c | 36 +++++++++++++++----------------
arch/x86/kernel/dumpstack_32.c | 3 +++
arch/x86/kernel/dumpstack_64.c | 23 ++++++++++++++++++++
arch/x86/kernel/process.c | 2 --
arch/x86/kernel/traps.c | 3 +--
arch/x86/power/cpu.c | 16 ++++++++------
arch/x86/xen/mmu_pv.c | 2 +-
15 files changed, 151 insertions(+), 71 deletions(-)

--
2.13.6

Next message: Andy Lutomirski: "[RFC 2/7] x86/gdt: Put per-cpu GDT remaps in ascending order"
Previous message: Steven Rostedt: "Re: [PATCH] tcp: Export to userspace the TCP state names for the trace events"
Next in thread: Andy Lutomirski: "[RFC 2/7] x86/gdt: Put per-cpu GDT remaps in ascending order"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]