Re: [RFC][PATCH v2 00/21] x86/pti: Defer CR3 switch to C code

From: Alexandre Chartre
Date: Tue Nov 17 2020 - 03:17:07 EST



On 11/16/20 9:24 PM, Borislav Petkov wrote:
On Mon, Nov 16, 2020 at 03:47:36PM +0100, Alexandre Chartre wrote:
Deferring CR3 switch to C code means that we need to run more of the
kernel entry code with the user page-table. To do so, we need to:

- map more syscall, interrupt and exception entry code into the user
page-table (map all noinstr code);

- map additional data used in the entry code (such as stack canary);

- run more entry code on the trampoline stack (which is mapped both
in the kernel and in the user page-table) until we switch to the
kernel page-table and then switch to the kernel stack;

So PTI was added exactly to *not* have kernel memory mapped in the user
page table. You're partially reversing that...

We are not reversing PTI, we are extending it.

PTI removes all kernel mapping from the user page-table. However there's
no issue with mapping some kernel data into the user page-table as long as
these data have no sensitive information.

Actually, PTI is already doing that but with a very limited scope. PTI adds
into the user page-table some kernel mappings which are needed for userland
to enter the kernel (such as the kernel entry text, the ESPFIX, the
CPU_ENTRY_AREA_BASE...).

So here, we are extending the PTI mapping so that we can execute more kernel
code while using the user page-table; it's a kind of PTI on steroids.


- have a per-task trampoline stack instead of a per-cpu trampoline
stack, so the task can be scheduled out while it hasn't switched
to the kernel stack.

per-task? How much more memory is that per task?


Currently, this is done by doubling the size of the task stack (patch 8),
so that's an extra 8KB. Half of the stack is used as the regular kernel
stack, and the other half used as the PTI stack:

+/*
+ * PTI doubles the size of the stack. The entire stack is mapped into
+ * the kernel address space. However, only the top half of the stack is
+ * mapped into the user address space.
+ *
+ * On syscall or interrupt, user mode enters the kernel with the user
+ * page-table, and the stack pointer is switched to the top of the
+ * stack (which is mapped in the user address space and in the kernel).
+ * The syscall/interrupt handler will then later decide when to switch
+ * to the kernel address space, and to switch to the top of the kernel
+ * stack which is only mapped in the kernel.
+ *
+ * +-------------+
+ * | | ^ ^
+ * | kernel-only | | KERNEL_STACK_SIZE |
+ * | stack | | |
+ * | | V |
+ * +-------------+ <- top of kernel stack | THREAD_SIZE
+ * | | ^ |
+ * | kernel and | | KERNEL_STACK_SIZE |
+ * | PTI stack | | |
+ * | | V v
+ * +-------------+ <- top of stack
+ */

The minimum size would be 1 page (4KB) as this is the minimum mapping size.
It's certainly enough for now as the usage of the PTI stack is limited, but
we will need larger stack if we won't to execute more kernel code with the
user page-table.

alex.