[RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

From: Alexandre Chartre
Date: Mon Nov 16 2020 - 09:47:11 EST


Version 2 addressing comments from Andy:

- paranoid_entry/exit is back to assembly code. This avoids having
a C version of SWAPGS and the need to disable stack-protector.
(remove patches 8, 9, 21 from v1).

- SAVE_AND_SWITCH_TO_KERNEL_CR3 and RESTORE_CR3 are removed from
paranoid_entry/exit and move to C (patch 19).

- __per_cpu_offset is mapped into the user page-table (patch 11)
so that paranoid_entry can update GS before CR3 is switched.

- use a different stack canary with the user and kernel page-tables.
This is a new patch in v2 to not leak the kernel stack canary
in the user page-table (patch 21).

Patches are now based on v5.10-rc4.

----

With Page Table Isolation (PTI), syscalls as well as interrupts and
exceptions occurring in userspace enter the kernel with a user
page-table. The kernel entry code will then switch the page-table
from the user page-table to the kernel page-table by updating the
CR3 control register. This CR3 switch is currently done early in
the kernel entry sequence using assembly code.

This RFC proposes to defer the PTI CR3 switch until we reach C code.
The benefit is that this simplifies the assembly entry code, and make
the PTI CR3 switch code easier to understand. This also paves the way
for further possible projects such an easier integration of Address
Space Isolation (ASI), or the possibilily to execute some selected
syscall or interrupt handlers without switching to the kernel page-table
(and thus avoid the PTI page-table switch overhead).

Deferring CR3 switch to C code means that we need to run more of the
kernel entry code with the user page-table. To do so, we need to:

- map more syscall, interrupt and exception entry code into the user
page-table (map all noinstr code);

- map additional data used in the entry code (such as stack canary);

- run more entry code on the trampoline stack (which is mapped both
in the kernel and in the user page-table) until we switch to the
kernel page-table and then switch to the kernel stack;

- have a per-task trampoline stack instead of a per-cpu trampoline
stack, so the task can be scheduled out while it hasn't switched
to the kernel stack.

Note that, for now, the CR3 switch can only be pushed as far as interrupts
remain disabled in the entry code. This is because the CR3 switch is done
based on the privilege level from the CS register from the interrupt frame.
I plan to fix this but that's some extra complication (need to track if the
user page-table is used or not).

The proposed patchset is in RFC state to get early feedback about this
proposal.

The code survives running a kernel build and LTP. Note that changes are
only for 64-bit at the moment, I haven't looked at 32-bit yet but I will
definitively check it.

Patches are based on v5.10-rc4.

Thanks,

alex.

-----

Alexandre Chartre (21):
x86/syscall: Add wrapper for invoking syscall function
x86/entry: Update asm_call_on_stack to support more function arguments
x86/entry: Consolidate IST entry from userspace
x86/sev-es: Define a setup stack function for the VC idtentry
x86/entry: Implement ret_from_fork body with C code
x86/pti: Provide C variants of PTI switch CR3 macros
x86/entry: Fill ESPFIX stack using C code
x86/pti: Introduce per-task PTI trampoline stack
x86/pti: Function to clone page-table entries from a specified mm
x86/pti: Function to map per-cpu page-table entry
x86/pti: Extend PTI user mappings
x86/pti: Use PTI stack instead of trampoline stack
x86/pti: Execute syscall functions on the kernel stack
x86/pti: Execute IDT handlers on the kernel stack
x86/pti: Execute IDT handlers with error code on the kernel stack
x86/pti: Execute system vector handlers on the kernel stack
x86/pti: Execute page fault handler on the kernel stack
x86/pti: Execute NMI handler on the kernel stack
x86/pti: Defer CR3 switch to C code for IST entries
x86/pti: Defer CR3 switch to C code for non-IST and syscall entries
x86/pti: Use a different stack canary with the user and kernel
page-table

arch/x86/entry/common.c | 58 ++++-
arch/x86/entry/entry_64.S | 346 +++++++++++---------------
arch/x86/entry/entry_64_compat.S | 22 --
arch/x86/include/asm/entry-common.h | 194 +++++++++++++++
arch/x86/include/asm/idtentry.h | 130 +++++++++-
arch/x86/include/asm/irq_stack.h | 11 +
arch/x86/include/asm/page_64_types.h | 36 ++-
arch/x86/include/asm/processor.h | 3 +
arch/x86/include/asm/pti.h | 18 ++
arch/x86/include/asm/stackprotector.h | 35 ++-
arch/x86/include/asm/switch_to.h | 7 +-
arch/x86/include/asm/traps.h | 2 +-
arch/x86/kernel/cpu/mce/core.c | 7 +-
arch/x86/kernel/espfix_64.c | 41 +++
arch/x86/kernel/nmi.c | 34 ++-
arch/x86/kernel/sev-es.c | 63 +++++
arch/x86/kernel/traps.c | 61 +++--
arch/x86/mm/fault.c | 11 +-
arch/x86/mm/pti.c | 76 ++++--
include/linux/sched.h | 8 +
kernel/fork.c | 25 ++
21 files changed, 874 insertions(+), 314 deletions(-)

--
2.18.4