Re: [kernel-hardening] Re: [RFC PATCH 6/6] arm64: add VMAP_STACK and detect out-of-bounds SP

From: Ard Biesheuvel
Date: Fri Jul 14 2017 - 08:27:50 EST

On 14 July 2017 at 11:48, Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> wrote:
> On 14 July 2017 at 11:32, Mark Rutland <mark.rutland@xxxxxxx> wrote:
>> On Thu, Jul 13, 2017 at 07:28:48PM +0100, Ard Biesheuvel wrote:
>>> On 13 July 2017 at 18:55, Mark Rutland <mark.rutland@xxxxxxx> wrote:
>>> > On Thu, Jul 13, 2017 at 05:10:50PM +0100, Mark Rutland wrote:
>>> >> On Thu, Jul 13, 2017 at 12:49:48PM +0100, Ard Biesheuvel wrote:
>>> >> > On 13 July 2017 at 11:49, Mark Rutland <mark.rutland@xxxxxxx> wrote:
>>> >> > > On Thu, Jul 13, 2017 at 07:58:50AM +0100, Ard Biesheuvel wrote:
>>> >> > >> On 12 July 2017 at 23:33, Mark Rutland <mark.rutland@xxxxxxx> wrote:
>>> >
>>> >> > Given that the very first stp in kernel_entry will fault if we have
>>> >> > less than S_FRAME_SIZE bytes of stack left, I think we should check
>>> >> > that we have at least that much space available.
>>> >>
>>> >> I was going to reply saying that I didn't agree, but in writing up
>>> >> examples, I mostly convinced myself that this is the right thing to do.
>>> >> So I mostly agree!
>>> >>
>>> >> This would mean we treat the first impossible-to-handle exception as
>>> >> that fatal case, which is similar to x86's double-fault, triggered when
>>> >> the HW can't stack the regs. All other cases are just arbitrary faults.
>>> >>
>>> >> However, to provide that consistently, we'll need to perform this check
>>> >> at every exception boundary, or some of those cases will result in a
>>> >> recursive fault first.
>>> >>
>>> >> So I think there are three choices:
>>> >>
>>> >> 1) In el1_sync, only check SP bounds, and live with the recursive
>>> >> faults.
>>> >>
>>> >> 2) in el1_sync, check there's room for the regs, and live with the
>>> >> recursive faults for overflow on other exceptions.
>>> >>
>>> >> 3) In all EL1 entry paths, check there's room for the regs.
>>> >
>>> > FWIW, for the moment I've applied (2), as you suggested, to my
>>> > arm64/vmap-stack branch, adding an additional:
>>> >
>>> > sub x0, x0, #S_FRAME_SIZE
>>> >
>>> > ... to the entry path.
>>> >
>>> > I think it's worth trying (3) so that we consistently report these
>>> > cases, benchmarks permitting.
>>> >
>>> OK, so here's a crazy idea: what if we
>>> a) carve out a dedicated range in the VMALLOC area for stacks
>>> b) for each stack, allocate a naturally aligned window of 2x the stack
>>> size, and map the stack inside it, leaving the remaining space
>>> unmapped
>> This is not such a crazy idea. :)
>> In fact, it was one I toyed with before getting lost on a register
>> juggling tangent (see below).
>>> That way, we can compare SP (minus S_FRAME_SIZE) against a mask that
>>> is a build time constant, to decide whether its value points into a
>>> stack or not. Of course, it may be pointing into the wrong stack, but
>>> that should not prevent us from taking the exception, and we can deal
>>> with that later. It would give us a very cheap way to perform this
>>> test on the hot paths.
>> The logical ops (TST) and conditional branches (TB(N)Z, CB(N)Z) operate
>> on XZR rather than SP, so to do this we need to get the SP value into a
>> GPR.
>> Previously, I assumed this meant we needed to corrupt a GPR (and hence
>> stash that GPR in a sysreg), so I started writing code to free sysregs.
>> However, I now realise I was being thick, since we can stash the GPR
>> in the SP:
>> sub sp, sp, x0 // sp = orig_sp - x0
>> add x0, sp, x0 // x0 = x0 - (orig_sp - x0) == orig_sp
>> sub x0, x0, #S_FRAME_SIZE
>> tb(nz) x0, #THREAD_SHIFT, overflow
>> add x0, x0, #S_FRAME_SIZE
>> sub x0, sp, x0

You need a neg x0, x0 here I think

>> add sp, sp, x0
>> ... so yes, this could work!
> Nice!

... only, this requires a dedicated stack region, and so we'd need to
check whether sp is inside that window as well.

The easieast way would be to use a window whose start address is base2
aligned, but that means the beginning of the kernel VA range (where
KASAN currently lives, and cannot be moved afaik), or a window at the
top of the linear region. Neither look very appealing

So that means arbitrary low and high limits to compare against in this
entry path. That means more GPRs I'm afraid.

>> This means that we have to align the initial task, so the kernel Image
>> will grow by THREAD_SIZE. Likewise for IRQ stacks, unless we can rework
>> things such that we can dynamically allocate all of those.
> We can't currently do that for 64k pages, since the segment alignment
> is only 64k. But we should be able to patch that up I think
>>> >> I believe that determining whether the exception was caused by a stack
>>> >> overflow is not something we can do robustly or efficiently.
>>> Actually, if the stack pointer is within S_FRAME_SIZE of the base, and
>>> the faulting address points into the guard page, that is a pretty
>>> strong indicator that the stack overflowed. That shouldn't be too
>>> costly?
>> Sure, but that's still a a heuristic. For example, that also catches an
>> unrelated vmalloc address gone wrong, while SP was close to the end of
>> the stack.
> Yes, but the likelihood that an unrelated stray vmalloc access is
> within 16 KB of a stack pointer that is close ot its limit is
> extremely low, so we should be able to live with the risk of
> misidentifying it.
>> The important thing is whether we can *safely enter the exception* (i.e.
>> stack the regs), or whether this'll push the SP (further) out-of-bounds.
>> I think we agree that we can reliably and efficiently check this.
> Yes.
>> The general case of nominal "stack overflows" (e.g. large preidx
>> decrements, proxied SP values, unrelated guard-page faults) is a
>> semantic minefield. I don't think we should add code to try to
>> distinguish these.
>> For that general case, if we can enter the exception then we can try to
>> handle the exception in the usual way, and either:
>> * The fault code determines the access was bad. We at least kill the
>> thread.
>> * We overflow the stack while trying to handle the exception, triggering
>> a new fault to triage.
>> To make it possible to distinguish and debug these, we need to fix the
>> backtracing code, but that's it.
>> Thanks,
>> Mark.