[RFC PATCH 0/5] x86: dynamic indirect call promotion

From: Nadav Amit
Date: Wed Oct 17 2018 - 20:56:35 EST


This RFC introduces indirect call promotion in runtime, which for the
matter of simplification (and branding) will be called here "relpolines"
(relative call + trampoline). Relpolines are mainly intended as a way
of reducing retpoline overheads due to Spectre v2.

Unlike indirect call promotion through profile guided optimization, the
proposed approach does not require a profiling stage, works well with
modules whose address is unknown and can adapt to changing workloads.

The main idea is simple: for every indirect call, we inject a piece of
code with fast- and slow-path calls. The fast path is used if the target
matches the expected (hot) target. The slow-path uses a retpoline.
During training, the slow-path is set to call a function that saves the
call source and target in a hash-table and keep count for call
frequency. The most common target is then patched into the hot path.

The patching is done on-the-fly by patching the conditional branch
(opcode and offset) that is used to compare the target to the hot
target. This allows to direct all cores to the fast-path, while patching
the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
patch a single byte when the code might be executed by any core. (2)
When patching more than one byte, ensure that all cores do not run the
to-be-patched-code by preventing this code from being preempted, and
using synchronize_sched() after patching the branch that jumps over this
code.

Changing all the indirect calls to use relpolines is done using assembly
macro magic. There are alternative solutions, but this one is
relatively simple and transparent. There is also logic to retrain the
software predictor, but the policy it uses may need to be refined.

Eventually the results are not bad (2 VCPU VM, throughput reported):

base relpoline
---- ---------
nginx 22898 25178 (+10%)
redis-ycsb 24523 25486 (+4%)
dbench 2144 2103 (+2%)

When retpolines are disabled, and if retraining is off, performance
benefits are up to 2% (nginx), but are much less impressive.

There are several open issues: retraining should be done when modules
are removed; CPU hotplug is not supported, x86-32 is probably broken and
the Makefile does not rebuild when the relpoline code is changed. Having
said that, I am worried that some of the approaches I took would
challenge the new code-of-conduct, so I though of getting some feedback
before putting more effort into it.

Nadav Amit (5):
x86: introduce preemption disable prefix
x86: patch indirect branch promotion
x86: interface for accessing indirect branch locations
x86: learning and patching indirect branch targets
x86: relpoline: disabling interface

arch/x86/entry/entry_64.S | 10 +
arch/x86/include/asm/nospec-branch.h | 158 +++++
arch/x86/include/asm/sections.h | 2 +
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/asm-offsets.c | 6 +
arch/x86/kernel/macros.S | 1 +
arch/x86/kernel/nospec-branch.c | 899 +++++++++++++++++++++++++++
arch/x86/kernel/vmlinux.lds.S | 7 +
arch/x86/lib/retpoline.S | 75 +++
include/linux/module.h | 5 +
kernel/module.c | 8 +
kernel/seccomp.c | 2 +
12 files changed, 1174 insertions(+)
create mode 100644 arch/x86/kernel/nospec-branch.c

--
2.17.1