Re: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall
From: Andy Lutomirski
Date: Mon Mar 18 2019 - 16:16:07 EST
On Mon, Mar 18, 2019 at 2:41 AM Elena Reshetova
<elena.reshetova@xxxxxxxxx> wrote:
>
> If CONFIG_RANDOMIZE_KSTACK_OFFSET is selected,
> the kernel stack offset is randomized upon each
> entry to a system call after fixed location of pt_regs
> struct.
>
> This feature is based on the original idea from
> the PaX's RANDKSTACK feature:
> https://pax.grsecurity.net/docs/randkstack.txt
> All the credits for the original idea goes to the PaX team.
> However, the design and implementation of
> RANDOMIZE_KSTACK_OFFSET differs greatly from the RANDKSTACK
> feature (see below).
>
> Reasoning for the feature:
>
> This feature aims to make considerably harder various
> stack-based attacks that rely on deterministic stack
> structure.
> We have had many of such attacks in past [1],[2],[3]
> (just to name few), and as Linux kernel stack protections
> have been constantly improving (vmap-based stack
> allocation with guard pages, removal of thread_info,
> STACKLEAK), attackers have to find new ways for their
> exploits to work.
>
> It is important to note that we currently cannot show
> a concrete attack that would be stopped by this new
> feature (given that other existing stack protections
> are enabled), so this is an attempt to be on a proactive
> side vs. catching up with existing successful exploits.
>
> The main idea is that since the stack offset is
> randomized upon each system call, it is very hard for
> attacker to reliably land in any particular place on
> the thread stack when attack is performed.
> Also, since randomization is performed *after* pt_regs,
> the ptrace-based approach to discover randomization
> offset during a long-running syscall should not be
> possible.
>
> [1] jon.oberheide.org/files/infiltrate12-thestackisback.pdf
> [2] jon.oberheide.org/files/stackjacking-infiltrate11.pdf
> [3] googleprojectzero.blogspot.com/2016/06/exploiting-
> recursion-in-linux-kernel_20.html
>
> Design description:
>
> During most of the kernel's execution, it runs on the "thread
> stack", which is allocated at fork.c/dup_task_struct() and stored in
> a per-task variable (tsk->stack). Since stack is growing downward,
> the stack top can be always calculated using task_top_of_stack(tsk)
> function, which essentially returns an address of tsk->stack + stack
> size. When VMAP_STACK is enabled, the thread stack is allocated from
> vmalloc space.
>
> Thread stack is pretty deterministic on its structure - fixed in size,
> and upon every entry from a userspace to kernel on a
> syscall the thread stack is started to be constructed from an
> address fetched from a per-cpu cpu_current_top_of_stack variable.
> The first element to be pushed to the thread stack is the pt_regs struct
> that stores all required CPU registers and sys call parameters.
>
> The goal of RANDOMIZE_KSTACK_OFFSET feature is to add a random offset
> after the pt_regs has been pushed to the stack and the rest of thread
> stack (used during the syscall processing) every time a process issues
> a syscall. The source of randomness can be taken either from rdtsc or
> rdrand with performance implications listed below. The value of random
> offset is stored in a callee-saved register (r15 currently) and the
> maximum size of random offset is defined by __MAX_STACK_RANDOM_OFFSET
> value, which currently equals to 0xFF0.
>
> As a result this patch introduces 8 bits of randomness
> (bits 4 - 11 are randomized, bits 0-3 must be zero due to stack alignment)
> after pt_regs location on the thread stack.
> The amount of randomness can be adjusted based on how much of the
> stack space we wish/can trade for security.
Why do you need four zero bits at the bottom? x86_64 Linux only
maintains 8 byte stack alignment.
>
> The main issue with this approach is that it slightly breaks the
> processing of last frame in the unwinder, so I have made a simple
> fix to the frame pointer unwinder (I guess others should be fixed
> similarly) and stack dump functionality to "jump" over the random hole
> at the end. My way of solving this is probably far from ideal,
> so I would really appreciate feedback on how to improve it.
That's probably a question for Josh :)
Another way to do the dirty work would be to do:
char *ptr = alloca(offset);
asm volatile ("" :: "m" (*ptr));
in do_syscall_64() and adjust compiler flags as needed to avoid warnings. Hmm.
>
> Performance:
>
> 1) lmbench: ./lat_syscall -N 1000000 null
> base: Simple syscall: 0.1774 microseconds
> random_offset (rdtsc): Simple syscall: 0.1803 microseconds
> random_offset (rdrand): Simple syscall: 0.3702 microseconds
>
> 2) Andy's tests, misc-tests: ./timing_test_64 10M sys_enosys
> base: 10000000 loops in 1.62224s = 162.22 nsec / loop
> random_offset (rdtsc): 10000000 loops in 1.64660s = 164.66 nsec / loop
> random_offset (rdrand): 10000000 loops in 3.51315s = 351.32 nsec / loop
>
Egads! RDTSC is nice and fast but probably fairly easy to defeat.
RDRAND is awful. I had hoped for better.
So perhaps we need a little percpu buffer that collects 64 bits of
randomness at a time, shifts out the needed bits, and refills the
buffer when we run out.
> /*
> * This does 'call enter_from_user_mode' unless we can avoid it based on
> * kernel config or using the static jump infrastructure.
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 1f0efdb7b629..0816ec680c21 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -167,13 +167,19 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
>
> PUSH_AND_CLEAR_REGS rax=$-ENOSYS
>
> + RANDOMIZE_KSTACK /* stores randomized offset in r15 */
> +
> TRACE_IRQS_OFF
>
> /* IRQs are off. */
> movq %rax, %rdi
> movq %rsp, %rsi
> + sub %r15, %rsp /* substitute random offset from rsp */
> call do_syscall_64 /* returns with IRQs disabled */
>
> + /* need to restore the gap */
> + add %r15, %rsp /* add random offset back to rsp */
Off the top of my head, the nicer way to approach this would be to
change this such that mov %rbp, %rsp; popq %rbp or something like that
will do the trick. Then the unwinder could just see it as a regular
frame. Maybe Josh will have a better idea.