Re: [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch)

From: Borislav Petkov
Date: Sun Nov 26 2017 - 13:52:13 EST


On Fri, Nov 24, 2017 at 06:23:53PM +0100, Ingo Molnar wrote:
> diff --git a/Documentation/x86/kaiser.txt b/Documentation/x86/kaiser.txt
> new file mode 100644

Here some text cleanups/typos fixes on top, after reading through it:

---
diff --git a/Documentation/x86/kaiser.txt b/Documentation/x86/kaiser.txt
index 46efa3662e22..f2df0441f6ea 100644
--- a/Documentation/x86/kaiser.txt
+++ b/Documentation/x86/kaiser.txt
@@ -4,9 +4,10 @@ Overview
KAISER is a countermeasure against attacks on kernel address
information. There are at least three existing, published,
approaches using the shared user/kernel mapping and hardware features
-to defeat KASLR. One approach referenced in the paper locates the
-kernel by observing differences in page fault timing between
-present-but-inaccessable kernel pages and non-present pages.
+to defeat KASLR. One approach referenced in the paper
+(https://gruss.cc/files/kaiser.pdf) locates the kernel by
+observing differences in page fault timing between
+present-but-inaccessible kernel pages and non-present pages.

When the kernel is entered via syscalls, interrupts or exceptions,
page tables are switched to the full "kernel" copy. When the
@@ -18,37 +19,36 @@ entry/exit functions themselves and the interrupt descriptor
table (IDT).

This helps to ensure that side-channel attacks that leverage the
-paging structures do not function when KAISER is enabled. It can be
-enabled by setting CONFIG_KAISER=y
+paging structures do not function when KAISER is enabled, by setting
+CONFIG_KAISER=y.

Page Table Management
=====================

When KAISER is enabled, the kernel manages two sets of page
tables. The first copy is very similar to what would be present
-for a kernel without KAISER. This includes a complete mapping of
-userspace that the kernel can use for things like copy_to_user().
+for a kernel without KAISER. It includes a complete mapping of
+userspace that the kernel needs for things like copy_*_user().

The second (shadow) is used when running userspace and mirrors the
-mapping of userspace present in the kernel copy. It maps a only
+mapping of userspace present in the kernel copy. It maps only
the kernel data needed to enter and exit the kernel.

The shadow is populated by the kaiser_add_*() functions. Only
-kernel data which has been explicity mapped will appear in the
-shadow copy. These calls are rare at runtime.
+kernel data which has been explicitly mapped will appear in the
+shadow copy. These calls are rare at runtime.

For a new userspace mapping, the kernel makes the entries in its
page tables like normal. The only difference is when the kernel
makes entries in the top (PGD) level. In addition to setting the
-entry in the main kernel PGD, a copy if the entry is made in the
+entry in the main kernel PGD, a copy of the entry is made in the
shadow PGD.

For user space mappings the kernel creates an entry in the kernel
PGD and the same entry in the shadow PGD, so the underlying page
-table to which the PGD entry points is shared down to the PTE
+table to which the PGD entry points to, is shared down to the PTE
level. This leaves a single, shared set of userspace page tables
-to manage. One PTE to lock, one set set of accessed bits, dirty
-bits, etc...
+to manage. One PTE to lock, one set of accessed, dirty bits, etc...

Overhead
========
@@ -76,8 +76,8 @@ this protection comes at a cost:
a. CR3 manipulation to switch between the page table copies
must be done at interrupt, syscall, and exception entry
and exit (it can be skipped when the kernel is interrupted,
- though.) Moves to CR3 are on the order of a hundred
- cycles, and are required every at entry and every at exit.
+ though.) CR3 modifications are in the order of a hundred
+ cycles, and are required at every entry and exit.
b. Task stacks must be mapped/unmapped. We need to walk
and modify the shadow page tables at fork() and exit().
c. Global pages are disabled. This feature of the MMU
@@ -91,7 +91,7 @@ this protection comes at a cost:
systems with PCID support, the context switch code must flush
both the user and kernel entries out of the TLB, with an
INVPCID in addition to the CR3 write. This INVPCID is
- generally slower than a CR3 write, but still on the order of
+ generally slower than a CR3 write, but still in the order of
a hundred cycles.
e. The shadow page tables must be populated for each new
process. Even without KAISER, the shared kernel mappings
@@ -123,7 +123,7 @@ Possible Future Work:
code or userspace since it will not have to reload all of
its TLB entries. However, its upside is limited by PCID
being used.
-4. Allow KAISER to enabled/disabled at runtime so folks can
+4. Allow KAISER to be enabled/disabled at runtime so folks can
run a single kernel image.

Debugging:
@@ -144,7 +144,7 @@ that are worth noting here.
running perf.
* Kernel crashes at the first exit to userspace. entry_64.S
bugs, or failing to map some of the exit code.
- * Crashes at first interrupt that interrupts userspace. The paths
+ * Crashes at the first interrupt that interrupts userspace. The paths
in entry_64.S that return to userspace are sometimes separate
from the ones that return to the kernel.
* Double faults: overflowing the kernel stack because of page
@@ -157,4 +157,3 @@ that are worth noting here.
as mount(8) failing to mount the rootfs. These have
tended to be TLB invalidation issues. Usually invalidating
the wrong PCID, or otherwise missing an invalidation.
-

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.