Re: x86: dynamic pt_regs pointer and kernel stack randomization

From: H. Peter Anvin
Date: Mon Apr 29 2024 - 15:10:49 EST


On April 29, 2024 9:09:02 AM PDT, Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>On Sun, Apr 28, 2024 at 03:28:44PM -0700, H. Peter Anvin wrote:
>> So the issue of having an explicit pointer to the user space pt_regs
>> structure has come up in the context of future FRED extensions, which may
>> increase the size of the exception stack frame under some circumstances,
>> which may not be constant in the first place.
>>
>> It struck me that this can be combined with kernel stack randomization in
>> such a way that it mostly amortizes the cost of both.
>
>I've tried to go get up to speed with what FRED changes in x86, and IIUC
>it effectively redefines the interrupt handling? Does it also change
>syscall entry? (I got the impression it does not, but the mention of
>syscalls later in your email make me think I've gotten this wrong.)
>
>I think currently kstack randomization isn't applied to interrupts (but
>it should be), so I'm all for finding a way to make this happen.
>
>> Downside: for best performance, it should be pushed into assembly entry/exit
>> code, although that is technically not *required* (and it is of course
>> possible to do it in C on IDT but in the one single assembly entry stub for
>> FRED.)
>>
>> In the FRED code it would look like [simplified]:
>>
>> asm_fred_entrypoint_user:
>> /* Current code */
>> /* ... */
>> movq %rsp,%rdi /* part of FRED_ENTER */
>>
>> /* New code */
>> movq %rdi,PER_CPU_VAR(user_pt_regs_ptr) /* [1] */
>> #ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
>> subq PER_CPU_VAR(kstack_offset),%rsp /* [2] */
>> #endif
>> /* CFI annotation */
>>
>> /* Current code */
>> call fred_entry_from_user
>>
>> asm_fred_exit_uTser:
>> /* New code */
>> movq PER_CPU_VAR(user_pt_regs_ptr),%rsp
>> /* CFI annotation */
>>
>> /* Current code */
>> /* ... */
>> ERETU
>
>If I'm understanding FRED correctly, I think this exit path
>would need to call choose_random_kstack_offset() as well. (For
>syscalls, add_random_kstack_offset() is used on entry and
>choose_random_kstack_offset() is used on exit, so I think we'd want the
>same pattern for interrupt handling.)
>
>> [1] - This instruction can be pushed into the C code in
>> fred_entry_from_user() without changing the functionality in any way.
>>
>> [2] - This is the ONLY instruction in this sequence that would be specific
>> to CONFIG_RANDOMIZE_KSTACK_OFFSET, and it probably isn't even worth patching
>> out.
>>
>> This requires a 64-bit premasked value to be generated byc
>> choose_random_kstack_offset() which would seem to be a better option for
>> performance, especially since there is already arithmetic done at that time.
>> Otherwise it requires three instructions. This means the randomness
>> accumulator ends up being in a separate variable from the premasked value.
>> This could be further very slightly optimized by adding the actual stack
>> location and making this a movq, but then that value has to be
>> context-switched; this is probably not all that useful.
>
>Yeah, having a pre-masked value ready to go seems fine to me. It's
>simply a space trade-off.
>
>> The masking needs to consider alignment, which the current code doesn't;
>> that by itself adds additional code to the current code sequences.
>>
>>
>> That is literally *all* the code that is needed to replace
>> add_random_kstack_offset(). It doesn't block tailcall optimization anywhere.
>> If user_pt_regs_ptr and kstack_offset share a cache line it becomes even
>> cheaper.
>>
>> Note that at least in the FRED case this code would be invoked even on
>> events other than system calls, some of which may be controllable by the
>> user, like page faults. I am guessing that this is actually a plus.
>
>Yeah, I'd like greater coverage for ring 3->0 transitions. We do want to
>double-check the original design choices, though. I *think* they still
>stand (see the comments for choose_random_kstack_offset() about entropy
>location (per-cpu area), and lifetime (across userspace execution).
>
>-Kees
>

FRED replaces ALL exception handling and ring transitions, including system calls and call gates.

You can't enter ring 0 without going through the FRED entry point (and you can't enter rings 1 or 2 at all, officially deprecating them from the architecture.)