[RFC PATCH v2 0/6] Per process PTI activation

From: Willy Tarreau
Date: Tue Jan 09 2018 - 07:58:33 EST


So here comes the second version after the first round of comments.

As suggested, I dropped the thread_info flag and placed it in the
mm_struct instead. There's now a per_cpu variable that can be checked
in the entry code to decide whether or not to switch CR3.

It's important to note that the new flag is lost upon execve(). I think
that this provides a better guarantee against any accidental use (eg: a
program calling some external helpers once in a while), but it also
means we can't use a wrapper anymore and have to modify the executable.

I continue to think that a mixed approach consisting in having a specific
flag that is only applied upon next execve() call and dropped could be
nice, but for now I'm not really sure how to do this cleanly.

Regarding the _PAGE_NX change, for now I didn't touch it. I like Andy's
approach consisting in changing it dynamically after the first page
fault caused by the return to userspace. I just don't know how to do
that for now.

I've split the entry code changes in two. The first part only updates the
kernel entry code to avoid updating CR3 if it already points to a kernel
PGD. The second one adds the flag check when going back to userspace.

This allowed me to check if the CR3-only changes brought any benefit, but
I failed to detect any improvement with that alone for now, including on
a preempt kernel.

With this patch, when haproxy starts with "arch_prctl(0x1022, 1)", the
performance drop compared to booting with "pti=off" is only ~1% and more
or less within measurement noise.

For now I've left the prctl to retrieve the current value as it helped
during debugging, though I think it should disappear before the final
version as it provides very little value.

Here are the numbers I'm seeing in the various situations for a few
tests on a hardware machine (core i7-4790K), numbers are in connections
per second, with the performance ratio compared to pti=off between
parenthesis :
TEST(*)
reject reject+acl forward
---------------+-------------+---------------+----------------
pti=off 444k (100%) 252k (100%) 83k (100%)
pti=on 382k (86%) 195k (77%) 71k (85%)
pti=on+prctl 439k (99%) 249k (99%) 83k (100%)

*: tests:
"reject" : reject rule, accept(), setsockopt() and close()
"reject+acl" : acl-based rule, does extra syscalls (getsockname(),
getsockopt, 2 setsockopt, recv, shutdown)
"forward" : connection forwarded to remote server, much heavier

It's interesting to node that the rule employing a few more syscalls
without adding much userspace work is obviously more impacted by PTI.
We have a total of 8 syscalls per connection on the middle one and
the difference is important.

Willy

Cc: Andy Lutomirski <luto@xxxxxxxxxx>
Cc: Borislav Petkov <bp@xxxxxxxxx>
Cc: Brian Gerst <brgerst@xxxxxxxxx>
Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Cc: Josh Poimboeuf <jpoimboe@xxxxxxxxxx>
Cc: "H. Peter Anvin" <hpa@xxxxxxxxx>
Cc: Kees Cook <keescook@xxxxxxxxxxxx>


Willy Tarreau (6):
x86/mm: add a pti_disable entry in mm_context_t
x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to
enable/disable PTI
x86/pti: add a per-cpu variable pti_disable
x86/pti: don't mark the user PGD with _PAGE_NX.
x86/entry/pti: avoid setting CR3 when it's already correct
x86/entry/pti: don't switch PGD on when pti_disable is set

arch/x86/entry/calling.h | 25 +++++++++++++++++++++++++
arch/x86/include/asm/mmu.h | 4 ++++
arch/x86/include/uapi/asm/prctl.h | 3 +++
arch/x86/kernel/process_64.c | 24 ++++++++++++++++++++++++
arch/x86/mm/pti.c | 2 ++
5 files changed, 58 insertions(+)

--
1.7.12.1