[PATCH] x86/doc: add PTI description

From: Dave Hansen
Date: Mon Dec 18 2017 - 17:04:59 EST



This got kicked out of the PTI set as the implementation diverged
from its contents. I've updated it so it can hopefully rejoin the
set.

---

From: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>

Add some details about how PTI works, what some of the downsides
are, and how to debug it when things go wrong.

Also document the kernel parameter: 'nopti'.

Signed-off-by: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
Cc: Moritz Lipp <moritz.lipp@xxxxxxxxxxxxxx>
Cc: Daniel Gruss <daniel.gruss@xxxxxxxxxxxxxx>
Cc: Michael Schwarz <michael.schwarz@xxxxxxxxxxxxxx>
Cc: Richard Fellner <richard.fellner@xxxxxxxxxxxxxxxxx>
Cc: Andy Lutomirski <luto@xxxxxxxxxx>
Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Cc: Kees Cook <keescook@xxxxxxxxxx>
Cc: Hugh Dickins <hughd@xxxxxxxxxx>
Cc: x86@xxxxxxxxxx
---

b/Documentation/admin-guide/kernel-parameters.txt | 4
b/Documentation/x86/pti.txt | 182 ++++++++++++++++++++++
2 files changed, 186 insertions(+)

diff -puN Documentation/admin-guide/kernel-parameters.txt~kpti-doc Documentation/admin-guide/kernel-parameters.txt
--- a/Documentation/admin-guide/kernel-parameters.txt~kpti-doc 2017-12-18 13:55:59.635504663 -0800
+++ b/Documentation/admin-guide/kernel-parameters.txt 2017-12-18 13:55:59.640504663 -0800
@@ -904,6 +904,10 @@
nopku [X86] Disable Memory Protection Keys CPU feature found
in some Intel CPUs.

+ nopti [X86] Disable Page Table Isolation. Disabling this
+ feature removes hardening, but improves performance
+ of system calls and interrupts.
+
module.async_probe [KNL]
Enable asynchronous probe on this module.

diff -puN /dev/null Documentation/x86/pti.txt
--- /dev/null 2017-12-15 13:48:30.454245127 -0800
+++ b/Documentation/x86/pti.txt 2017-12-18 13:57:29.433504439 -0800
@@ -0,0 +1,182 @@
+Overview
+========
+
+Page Table Isolation (pti, previously known as KAISER[1]) is a
+countermeasure against attacks on kernel address information. There
+are at least three existing, published, approaches using the shared
+user/kernel mapping and hardware features to defeat KASLR. One
+approach referenced in a paper[1] locates the kernel by observing
+differences in page fault timing between present-but-inaccessable
+kernel pages and non-present pages.
+
+To avoid leaking address information, we create an new, independent
+copy of the page tables which are used only when running userspace
+applications. When the kernel is entered via syscalls, interrupts or
+exceptions, page tables are switched to the full "kernel" copy. When
+the system switches back to user mode, the user copy is used again.
+
+The userspace page tables contain only a minimal amount of kernel
+data: only what is needed to enter/exit the kernel such as the
+entry/exit functions themselves and the interrupt descriptor table
+(IDT). There are a few unnecessary things that get mapped such as the
+first C function when entering an interrupt (see comments in pti.c).
+
+This approach helps to ensure that side-channel attacks that leverage
+the paging structures do not function when PTI is enabled. It can be
+enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.
+Once enabled at compile-time, it can be disabled at boot with the
+'nopti' kernel parameter.
+
+Page Table Management
+=====================
+
+When PTI is enabled, the kernel manages two sets of page
+tables. The first copy is very similar to what would be present
+for a kernel without PTI. This includes a complete mapping of
+userspace that the kernel can use for things like copy_to_user().
+
+The userspace copy is used when running userspace and mirrors the
+mapping of userspace present in the kernel copy. It maps a only
+the kernel data needed to enter and exit the kernel. This data
+is entirely contained in the 'struct cpu_entry_area' structure
+which is placed in the fixmap and thus each CPU's copy of the
+area has a compile-time-fixed virtual address.
+
+For new userspace mappings, the kernel makes the entries in its
+page tables like normal. The only difference is when the kernel
+makes entries in the top (PGD) level. In addition to setting the
+entry in the main kernel PGD, a copy of the entry is made in the
+userspace page tables' PGD.
+
+This sharing at the PGD level also inherently shares all the lower
+layers of the page tables. This leaves a single, shared set of
+userspace page tables to manage. One PTE to lock, one set set of
+accessed bits, dirty bits, etc...
+
+Overhead
+========
+
+Protection against side-channel attacks is important. But,
+this protection comes at a cost:
+
+1. Increased Memory Use
+ a. Each process now needs an order-1 PGD instead of order-0.
+ (Consumes 4k per process).
+ b. The pre-allocated second-level (p4d or pud) kernel page
+ table pages cost ~1MB of additional memory at boot. This
+ is not totally wasted because some of these pages would
+ have been needed eventually for normal kernel page tables
+ and things in the vmalloc() area like vmemmap[].
+ c. The 'cpu_entry_area' structure must be 2MB in size and 2MB
+ aligned so that it can be mapped by setting a single PMD
+ entry. This consumes nearly 2MB of RAM once the kernel
+ is decompressed, but no space in the kernel image itself.
+
+2. Runtime Cost
+ a. CR3 manipulation to switch between the page table copies
+ must be done at interrupt, syscall, and exception entry
+ and exit (it can be skipped when the kernel is interrupted,
+ though.) Moves to CR3 are on the order of a hundred
+ cycles, and are required every at entry and every at exit.
+ b. A "trampoline" must be used for SYSCALL entry. This
+ trampoline depends on a smaller set of resources than the
+ non-PTI SYSCALL entry code, so requires mapping fewer
+ things into the userspace page tables. The downside is
+ that stacks must be switched at entry time.
+ d. Global pages are disabled for all kernel structures not
+ mapped in both to kernel and userspace page tables. This
+ feature of the MMU allows different processes to share TLB
+ entries mapping the kernel. Losing the feature means more
+ TLB misses after a context switch. The actual loss of
+ performance is very small, however, never exceeding 1%.
+ d. Process Context IDentifiers (PCID) is a CPU feature that
+ allows us to skip flushing the entire TLB when switching page
+ tables. This makes switching the page tables (at context
+ switch, or kernel entry/exit) cheaper. But, on systems with
+ PCID support, the context switch code must flush both the user
+ and kernel entries out of the TLB. The user PCID TLB flush is
+ deferred until the exit to userspace, minimizing the cost.
+ e. The userspace page tables must be populated for each new
+ process. Even without PTI, the shared kernel mappings
+ are created by copying top-level (PGD) entries into each
+ new process. But, with PTI, there are now *two* kernel
+ mappings: one in the kernel page tables that maps everything
+ and one for the entry/exit structures. At fork(), we need to
+ copy both.
+ f. In addition to the fork()-time copying, there must also
+ be an update to the userspace PGD any time a set_pgd() is done
+ on a PGD used to map userspace. This ensures that the kernel
+ and userspace copies always map the same userspace
+ memory.
+ g. On systems without PCID support, each CR3 write flushes
+ the entire TLB. That means that each syscall, interrupt
+ or exception flushes the TLB.
+
+Possible Future Work
+====================
+1. We can be more careful about not actually writing to CR3
+ unless its value is actually changed.
+2. Continue to compress the userspace-mapped data to be mapped
+ together. Currently, it is using four entries, but could
+ be further minimized.
+3. Allow PTI to enabled/disabled at runtime in addition to the
+ boot-time switching.
+
+Testing
+========
+
+To test stability of PTI, the following test procedure is recommended,
+ideally doing all of these in parallel:
+
+1. Set CONFIG_DEBUG_ENTRY=y
+2. Run several copies of all of the tools/testing/selftests/x86/ tests
+ (excluding MPX and protection_keys) in a loop on multiple CPUs for
+ several minutes. These tests frequently uncover corner cases in the
+ kernel entry code. In general, old kernels might cause these tests
+ themselves to crash, but they should never crash the kernel.
+3. Run the 'perf' tool in a mode (top or record) that generates many
+ frequent performance monitoring non-maskable interrupts (see "NMI"
+ in /proc/interrupts). This exercises the NMI entry/exit code which
+ is known to trigger bugs in code paths that did not expect to be
+ interrupted, including nested NMIs. Using "-c" boosts the rate of
+ NMIs, and using two -c with separate counters encourages nested NMIs
+ and less deterministic behavior.
+
+ while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done
+
+4. Launch a KVM virtual machine.
+
+Debugging
+=========
+
+Bugs in PTI cause a few different signatures of crashes
+that are worth noting here.
+
+ * Failures of the selftests/x86 code. Usually a bug in one of the
+ more obscure corners of entry_64.S
+ * Crashes in early boot, especially around CPU bringup. Bugs
+ in the trampoline code or mappings cause these.
+ * Crashes at the first interrupt. Caused by bugs in entry_64.S,
+ like screwing up a page table switch. Also caused by
+ incorrectly mapping the IRQ handler entry code.
+ * Crashes at the first NMI. The NMI code is separate from main
+ interrupt handlers and can have bugs that do not affect
+ normal interrupts. Also caused by incorrectly mapping NMI
+ code. NMIs that interrupt the entry code must be very
+ careful and can be the cause of crashes that show up when
+ running perf.
+ * Kernel crashes at the first exit to userspace. entry_64.S
+ bugs, or failing to map some of the exit code.
+ * Crashes at first interrupt that interrupts userspace. The paths
+ in entry_64.S that return to userspace are sometimes separate
+ from the ones that return to the kernel.
+ * Double faults: overflowing the kernel stack because of page
+ faults upon page faults. Caused by touching non-pti-mapped
+ data in the entry code, or forgetting to switch to kernel
+ CR3 before calling into C functions which are not pti-mapped.
+ * Userspace segfaults early in boot, sometimes manifesting
+ as mount(8) failing to mount the rootfs. These have
+ tended to be TLB invalidation issues. Usually invalidating
+ the wrong PCID, or otherwise missing an invalidation.
+
+1. https://gruss.cc/files/kaiser.pdf
_