RFD: Fastpath amelioration of the KAISER/KPTI performance impact

From: Kalle A. Sandstrom
Date: Thu Jan 04 2018 - 12:55:35 EST



[presented with intent to amuse and edumacate, here's a little something
something for the current performance crisis.]


--- cut here ---

Fastpath amelioration of the KAISER fixes' performance impact in Linux.
Kalle A. Sandström, 20180104

[DRAFT VERSION 0: not for publication. not even for serious consideration; v0
should be read as an elaborate joke.]


ABSTRACT.

This document identifies an opportunity for clawing back some of the
performance penalty from the KAISER/KPTI security patch by means of
fast-pathing interprocess communication in the section of code that'd
otherwise trampoline kernel entry. Two possible designs to this end are
briefly outlined.

The designs presented are for the very worst case where microcode updates
don't appear, or are restricted to new CPU models, and consequently
KAISER/KPTI is here to stay for a hojillion people. All of this may be a
terrible idea. Caveat lector; a good argument can be made in favour of not
looking into the abyss.


SYNOPSIS.

Increase the constant function fragment's footprint to handle some forms of
task switching and inter-process communication without enabling the kernel
proper, thereby halving the number of TLB flushes over some IPC roundtrips.
The IPC mechanism might be something as involved as a reimplementation of most
POSIX I/O, or as minimal as a rendezvous synchronization primitive combined
with existing shared memory gubbins. Distinguish this from the ``big'' kernel
with filesystems, MM, block devices, and anything with an infinite memory
requirement; which is stashed behind the extra TLB flush.

Call the intermediary an ittybittykernel. *rimshot*

While most performance gains from this general approach should happen early
on, higher-hanging fruit will be available for a long time to come, so the CFF
is expected to grow indefinitely. It could be foreseen that there'll be a
long-term game of cat and mouse between the speculative information leak
finder and the perpetually-appointed security engineer, providing both with
long-term careers in computational esoterica.

This design document presents a speculative development path towards such a
Frankenstein's architecture as well as a first step along that path,
ultimately motivated by the prospect of recovering some of that putative 20%
performance penalty. On the downside, even the best result will still be worse
than a hypothetical microkernel system written from scratch, but only until
the CPU manufacturers repair their emissions: after that monolithic will rule
microbenchmarks once more (on new hardware, and chips where a microcode fix is
available & yields a lesser penalty).


BACKGROUND.

The KAISER patch makes the kernel invulnerable to the speculative address
space probing feature of certain Intel processors (the ``Meltdown''
vulnerability). It accomplishes this at the cost of a TLB flush coming and
going per syscall, which brings their minimum number over the shortest
possible inter-process roundtrip to 4.

This is a heavy performance cost in applications where out-of-process
computation doesn't dominate TLB reload overhead. It could even be said that
in terms of performance, KAISER turns Linux into the worst possible
microkernel system: one where exactly no services are provided by the
intermediate layer but all of a monolithic design's downsides are retained,
leaving the intermediary's introduction a step for the worse from all
perspectives besides security.


PROPOSAL.

Instead of having the kernel mapped into each process and serving syscalls
etc. directly, the KAISER patch changes the kernel interface to an analogue of
what's used in 4G/4G mode. That's to say, it forwards kernel entry via a set
of IDT and syscall trampolines over the TLB flush boundary into what's
effectively a separate kernel address space. The simplified rationale is that
since the region containing the trampolines is small and its contents easily
audited for security issues, this prevents both leakage of useful information
regarding kernel address space layout randomization, and (consequently) the
utilization of speculative kernel information leak vulnerabilities without
(an)other ASLR leak(s).

The proposal at hand amounts to an increase in the footprint of this
``constant function fragment'' to the end that communication between the X
server and its clients wouldn't suffer double TLB flushes. Two distinct means
are proposed: the first is a conservative reimplementation of a subset of
POSIX file descriptor and process management, and UNIX domain sockets; and the
second a simplistic ``shared memory with rendezvous sync'' primitive coupled
with fiddly business in the C library and a legacy fallback.

Regardless of design particulars, the additional code's presence is justified
by being eventually fully auditable for both KASLR information leaks and
exploitable speculative-execution gadgets. That's to say: its security follows
from limited scope and legions of hungry twentysomethings over however many
years it takes.


DESIGN.

[Imagine a convincing argument here about how UNIX domain sockets require much
of UNIX to operate properly, and how that path from one process to the next
isn't gonna fit in a small enough binary to audit properly. It would be
compatible with POSIX I/O though.]

A rendezvous signaling mechanism would identify senders and recipients by
cookie tied to a thread identity and an address space defined by big-kernel
memory management, rendering it able to switch processes and drop to the big
kernel for scheduling, but little else. Its requirements are first, that user
contexts involved in rendezvous signaling be available to CFF space; second,
that the CFF is able to switch between address spaces without dropping to the
big kernel; third, that communication security is managed by the big kernel in
the form of ahead-of-time checking; and fourth, that a magical unicorn will
take care of multiprocessing details. The rest is left as an exercise for the
reader.

On the downside, a thin mechanism like that would require reimplementation of
UNIX domain sockets in some compatible way, mainly in userspace, and under
similar security demands if it's not to be worse in that regard than the
alternative. It's also guaranteed not to interoperate with any select(2)-like
syscall due to its separation from POSIX; consequently threads receiving data
via that mechanism will only be able to pass it to threads involved in POSIX
I/O using heavyweight syscalls. There's also a bunch of interrupt, timeout,
lifecycle, etc. semantics that're going unexplored here.

It may be that one of these mechanisms offers more benefit to various pre-flip
IPI etc. handlers in the CFF, than the other.


IMPLEMENTATION (USER SPACE).

Legacy static binaries, and those without a recent C library, will continue to
use the syscall interface, dropping into the big kernel as a matter of course.
New and compatible binaries will link to a C library which implements some
cases of POSIX I/O in terms of the fastpath IPC mechanism by means of unicorn
farts & kitten giggles.


IMPLEMENTATION (KERNEL).

[Imagine a fancy data structure for storing IPC endpoint cookies and so forth
without exposing their contents through a Meltdown gadget.]

[Imagine elaborate whatsits for sharing some pages between the CFF and the big
kernel for e.g. user context management, inter-processor interactions, and so
forth.]

[Imagine the legwork it takes to get the job done. phew! what a slog.]


TESTING.

Just run whatever old bollocks on top, see how it breaks. The usual.


CONCLUSION.

Of course I'm having a giggle; effluent just became actual! But for how long?

--- cut here ---

-KS