Re: [RFC PATCH v2 6/6] x86/entry/pti: don't switch PGD on when pti_disable is set

From: Dave Hansen
Date: Thu Jan 11 2018 - 10:29:40 EST


On 01/10/2018 10:42 PM, Willy Tarreau wrote:
> On Wed, Jan 10, 2018 at 11:50:46AM -0800, Linus Torvalds wrote:
>> And the whole "NOW" vs "NEXT" is complete garbage. The obvious sane
>> no-PTI interface is that it
>>
>> (a) inherits on fork/exec, so that you don't have to worry about how
>> something is implemented (think "I want to run this kernel build
>> without the PTI overhead", but also "I want to run this system daemon
>> without PTI").
>>
>> (b) actual domain changes clear it (ie suid, whatever).
>>
>> that make it useful for random uses of "I trust service XYZ".
> OK. Do you want to see something *only* based on a wrapper (i.e. works
> only after execve) or can we let the application apply the change to
> itself ? I would also like to let applications re-enable the protection
> for processes they're going to exec and not necessarily trust.

I don't think we need a "NOW" and "NEXT" mode, at least initially. The
"NEXT" semantics are going to be tricky and I think "NOW" is good enough

Whatever we do, we'll need this PTI-disable flag to be able cross
exeve() so that a wrapper a la nice(1) work. Initially, I think the
default should be that it survives fork(). There are just too many
things out there that "start up" by doing a shell script that calls a
python script, that calls a...

Without the wrapper support, we're _basically_ stuck using this only in
newly-compiled binaries. That's going to make it much less likely to
get used.

The inheritance also gives an app a way to re-enable protections for
children, just from a _second_ wrapper. That's nice because it means we
don't initially need a "NEXT" ABI.

So, I'd do this:
1. Do the arch_prctl() (but ask the ARM guys what they want too)
2. Enabled for an entire process (not thread)
3. Inherited across fork/exec
4. Cleared on setuid() and friends
5. I'm sure the security folks have/want a way to force it on forever

Next, if we decide that we have things that both don't want PTI's
protections and are forking things not covered by #4, we can add some
"child opt out" in the prctl(), plus maybe marking binaries somehow.

Please don't forget to add ways to tell if this feature is on/off in
/proc or whatever. I think we also need to be able to dump the actual
CR3 value that we entered the kernel with before we start doing too much
other funky stuff with the entry code.