RE: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon syscall

From: Reshetova, Elena
Date: Wed Mar 20 2019 - 08:04:25 EST


Smth is really weird with my intel mail: it only now delivered
me all messages in one go and I was thinking that I don't get any feedback...

> > If CONFIG_RANDOMIZE_KSTACK_OFFSET is selected,
> > the kernel stack offset is randomized upon each
> > entry to a system call after fixed location of pt_regs
> > struct.
> >
> > This feature is based on the original idea from
> > the PaX's RANDKSTACK feature:
> > https://pax.grsecurity.net/docs/randkstack.txt
> > All the credits for the original idea goes to the PaX team.
> > However, the design and implementation of
> > RANDOMIZE_KSTACK_OFFSET differs greatly from the RANDKSTACK
> > feature (see below).
> >
> > Reasoning for the feature:
> >
> > This feature aims to make considerably harder various
> > stack-based attacks that rely on deterministic stack
> > structure.
> > We have had many of such attacks in past [1],[2],[3]
> > (just to name few), and as Linux kernel stack protections
> > have been constantly improving (vmap-based stack
> > allocation with guard pages, removal of thread_info,
> > STACKLEAK), attackers have to find new ways for their
> > exploits to work.
> >
> > It is important to note that we currently cannot show
> > a concrete attack that would be stopped by this new
> > feature (given that other existing stack protections
> > are enabled), so this is an attempt to be on a proactive
> > side vs. catching up with existing successful exploits.
> >
> > The main idea is that since the stack offset is
> > randomized upon each system call, it is very hard for
> > attacker to reliably land in any particular place on
> > the thread stack when attack is performed.
> > Also, since randomization is performed *after* pt_regs,
> > the ptrace-based approach to discover randomization
> > offset during a long-running syscall should not be
> > possible.
> >
> > [1] jon.oberheide.org/files/infiltrate12-thestackisback.pdf
> > [2] jon.oberheide.org/files/stackjacking-infiltrate11.pdf
> > [3] googleprojectzero.blogspot.com/2016/06/exploiting-
> > recursion-in-linux-kernel_20.html
> >
> > Design description:
> >
> > During most of the kernel's execution, it runs on the "thread
> > stack", which is allocated at fork.c/dup_task_struct() and stored in
> > a per-task variable (tsk->stack). Since stack is growing downward,
> > the stack top can be always calculated using task_top_of_stack(tsk)
> > function, which essentially returns an address of tsk->stack + stack
> > size. When VMAP_STACK is enabled, the thread stack is allocated from
> > vmalloc space.
> >
> > Thread stack is pretty deterministic on its structure - fixed in size,
> > and upon every entry from a userspace to kernel on a
> > syscall the thread stack is started to be constructed from an
> > address fetched from a per-cpu cpu_current_top_of_stack variable.
> > The first element to be pushed to the thread stack is the pt_regs struct
> > that stores all required CPU registers and sys call parameters.
> >
> > The goal of RANDOMIZE_KSTACK_OFFSET feature is to add a random offset
> > after the pt_regs has been pushed to the stack and the rest of thread
> > stack (used during the syscall processing) every time a process issues
> > a syscall. The source of randomness can be taken either from rdtsc or
> > rdrand with performance implications listed below. The value of random
> > offset is stored in a callee-saved register (r15 currently) and the
> > maximum size of random offset is defined by __MAX_STACK_RANDOM_OFFSET
> > value, which currently equals to 0xFF0.
> >
> > As a result this patch introduces 8 bits of randomness
> > (bits 4 - 11 are randomized, bits 0-3 must be zero due to stack alignment)
> > after pt_regs location on the thread stack.
> > The amount of randomness can be adjusted based on how much of the
> > stack space we wish/can trade for security.
>
> Why do you need four zero bits at the bottom? x86_64 Linux only
> maintains 8 byte stack alignment.

I have to check this: it did look to me that this is needed to avoid
alignment issues, but maybe it is my mistake.

> >
> > The main issue with this approach is that it slightly breaks the
> > processing of last frame in the unwinder, so I have made a simple
> > fix to the frame pointer unwinder (I guess others should be fixed
> > similarly) and stack dump functionality to "jump" over the random hole
> > at the end. My way of solving this is probably far from ideal,
> > so I would really appreciate feedback on how to improve it.
>
> That's probably a question for Josh :)
>
> Another way to do the dirty work would be to do:
>
> char *ptr = alloca(offset);
> asm volatile ("" :: "m" (*ptr));
>
> in do_syscall_64() and adjust compiler flags as needed to avoid warnings. Hmm.

I was hoping to go away with assembly-only and minimal
changes, but if this approach seems better for you and Josh,
then I guess I can do it this way.

>
> >
> > Performance:
> >
> > 1) lmbench: ./lat_syscall -N 1000000 null
> > base: Simple syscall: 0.1774 microseconds
> > random_offset (rdtsc): Simple syscall: 0.1803 microseconds
> > random_offset (rdrand): Simple syscall: 0.3702 microseconds
> >
> > 2) Andy's tests, misc-tests: ./timing_test_64 10M sys_enosys
> > base: 10000000 loops in 1.62224s = 162.22 nsec / loop
> > random_offset (rdtsc): 10000000 loops in 1.64660s = 164.66 nsec / loop
> > random_offset (rdrand): 10000000 loops in 3.51315s = 351.32 nsec / loop
> >
>
> Egads! RDTSC is nice and fast but probably fairly easy to defeat.
> RDRAND is awful. I had hoped for better.

Yes, it is very very slow, I actually didn't believe my measurements
first thinking that it cannot be so much slower just because of one
instruction difference, but it looks like it can...

>
> So perhaps we need a little percpu buffer that collects 64 bits of
> randomness at a time, shifts out the needed bits, and refills the
> buffer when we run out.

Hm... We might have to refill pretty often on a syscall-hungry
workloads. If we need 8 bits for each sys call, then we will refill
every 8 syscalls, which is of course better than each one, but is
it an acceptable penalty? And then there is also a storage issue
of our offset bits as Kees mentioned.

>
> > /*
> > * This does 'call enter_from_user_mode' unless we can avoid it based on
> > * kernel config or using the static jump infrastructure.
> > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> > index 1f0efdb7b629..0816ec680c21 100644
> > --- a/arch/x86/entry/entry_64.S
> > +++ b/arch/x86/entry/entry_64.S
> > @@ -167,13 +167,19 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
> >
> > PUSH_AND_CLEAR_REGS rax=$-ENOSYS
> >
> > + RANDOMIZE_KSTACK /* stores randomized offset in r15 */
> > +
> > TRACE_IRQS_OFF
> >
> > /* IRQs are off. */
> > movq %rax, %rdi
> > movq %rsp, %rsi
> > + sub %r15, %rsp /* substitute random offset from rsp */
> > call do_syscall_64 /* returns with IRQs disabled */
> >
> > + /* need to restore the gap */
> > + add %r15, %rsp /* add random offset back to rsp */
>
> Off the top of my head, the nicer way to approach this would be to
> change this such that mov %rbp, %rsp; popq %rbp or something like that
> will do the trick. Then the unwinder could just see it as a regular
> frame. Maybe Josh will have a better idea.

I tried it with rbp, but I could not get it working as with other callee
saved registers. But since alloca method seems to be more preferable,
maybe it is not worth investigating this further.

Best Regards,
Elena.