Re: [RFC PATCH 6/6] arm64: add VMAP_STACK and detect out-of-bounds SP
From: Mark Rutland
Date: Thu Jul 13 2017 - 06:51:05 EST
On Thu, Jul 13, 2017 at 07:58:50AM +0100, Ard Biesheuvel wrote:
> Hi Mark,
Hi,
> On 12 July 2017 at 23:33, Mark Rutland <mark.rutland@xxxxxxx> wrote:
> > +#ifdef CONFIG_VMAP_STACK
> > +.macro detect_bad_stack
> > + msr sp_el0, x0
> > + get_thread_info x0
> > + ldr x0, [x0, #TSK_TI_CUR_STK]
> > + sub x0, sp, x0
> > + and x0, x0, #~(THREAD_SIZE - 1)
> > + cbnz x0, __bad_stack
> > + mrs x0, sp_el0
>
> The typical prologue looks like
>
> stp x29, x30, [sp, #-xxx]!
> stp x27, x28, [sp, #xxx]
> ...
> mov x29, sp
>
> which means that in most cases where we do run off the stack, sp will
> still be pointing into it when the exception is taken. This means we
> will fault recursively in the handler before having had the chance to
> accurately record the exception context.
True; I had mostly been thinking about kernel_entry, where we do an
explicit subtraction from the SP before any stores.
> Given that the max displacement of a store instruction is 512 bytes,
> and that the frame size we are about to stash exceeds that, should we
> already consider it a stack fault if sp is within 512 bytes (or
> S_FRAME_SIZE) of the base of the stack?
Good point.
I've flip-flopped on this point while writing this reply.
My original line of thinking was that it was best to rely on the
recursive fault to push the SP out-of-bounds. That keeps the overflow
detection simple/fast, and hopefully robust to unexpected exceptions,
(expected?) probes to the guard page, etc.
I also agree that it's annoying to lose the information associated with the
initial fault.
My fear is that we can't catch those cases robustly and efficiently. At
minimum, I believe we'd need to check:
* FAR_EL1 is out-of-bounds for the stack. You have a suitable check for
this.
* FAR_EL1 is valid (looking at the ESR_ELx.{EC,ISS}, etc). I'm not sure
exactly what we need to check here, and I'm not sure what we want to
do about reserved ESR_ELx encodings.
* The base register for the access was the SP (e.g. so this isn't a
probe_kernel_read() or similar).
... so my current feeling is that relying on the recursive fault is the
best bet, even if we lose some information from the initial fault.
Along with that, we should ensure that we get a reliable backtrace, so
that we have the PC from the initial fault, and can acquire the relevant
regs from a dump of the stack and/or the regs at the point of the
recursive fault.
FWIW, currently this series gives you something like:
[ 0.263544] Stack out-of-bounds!
[ 0.263544] sp: 0xffff000009fbfed0
[ 0.263544] tsk stack: [0xffff000009fc0000..0xffff000009fd0000]
[ 0.263544] irq stack: [0xffff80097fe100a0..0xffff80097fe200a0]
[ 0.304862] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.12.0-00006-g0c4fb26-dirty #73
[ 0.312830] Hardware name: ARM Juno development board (r1) (DT)
[ 0.318847] task: ffff800940d8a200 task.stack: ffff000009fc0000
[ 0.324872] PC is at el1_sync+0x20/0xc8
[ 0.328773] LR is at force_overflow+0xc/0x18
[ 0.333113] pc : [<ffff000008082460>] lr : [<ffff00000808c75c>] pstate: 600003c5
[ 0.340636] sp : ffff000009fbfed0
[ 0.344004] x29: ffff000009fc0000 x28: 0000000000000000
[ 0.349409] x27: 0000000000000000 x26: 0000000000000000
[ 0.354812] x25: 0000000000000000 x24: 0000000000000000
[ 0.360214] x23: 0000000000000000 x22: 0000000000000000
[ 0.365617] x21: 0000000000000000 x20: 0000000000000001
[ 0.371020] x19: 0000000000000001 x18: 0000000000000030
[ 0.376422] x17: 0000000000000000 x16: 0000000000000000
[ 0.381826] x15: 0000000000000008 x14: 000000000fb506bc
[ 0.387228] x13: 0000000000000000 x12: 0000000000000000
[ 0.392631] x11: 0000000000000000 x10: 0000000000000141
[ 0.398034] x9 : 0000000000000000 x8 : ffff80097fdf93e8
[ 0.403437] x7 : ffff80097fdf9410 x6 : 0000000000000001
[ 0.408839] x5 : ffff000008ebcb80 x4 : ffff000008eb65d8
[ 0.414242] x3 : 00000000000f4240 x2 : 0000000000000002
[ 0.419644] x1 : ffff800940d8a200 x0 : 0000000000000001
[ 0.425048] Kernel panic - not syncing: stack out-of-bounds
[ 0.430714] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.12.0-00006-g0c4fb26-dirty #73
[ 0.438679] Hardware name: ARM Juno development board (r1) (DT)
[ 0.444697] Call trace:
[ 0.447185] [<ffff000008086f68>] dump_backtrace+0x0/0x230
[ 0.452676] [<ffff00000808725c>] show_stack+0x14/0x20
[ 0.457815] [<ffff00000838760c>] dump_stack+0x9c/0xc0
[ 0.462953] [<ffff00000816d5a0>] panic+0x11c/0x294
[ 0.467825] [<ffff000008087a70>] __pte_error+0x0/0x28
[ 0.472961] [<ffff00000808c75c>] force_overflow+0xc/0x18
[ 0.478364] SMP: stopping secondary CPUs
[ 0.482356] ---[ end Kernel panic - not syncing: stack out-of-bounds
... that __pte_error() is because the last instruction in handle_bad_stack is a
tail-call to panic, and __pte_error happens to be next in the text.
I haven't yet dug into why the stacktrace ends abruptly. I think I need
to update stack walkers to understand the new stack, but I may also have
forgotten to do something with the frame record in the entry path.
[...]
> > +#ifdef CONFIG_VMAP_STACK
> > +DEFINE_PER_CPU(unsigned long [IRQ_STACK_SIZE/sizeof(long)], bad_stack) __aligned(16);
> > +
>
> Surely, we don't need a 16 KB or 64 KB stack here?
For most cases, we do not need such a big stack. We can probably drop
this down to something much smaller (1K, as with your series, sounds
sufficient).
The one case I was worried about was overflows on the emergency stack
itself. I believe that for dumping memory we might need to fix up
exceptions, and if that goes wrong we could go recursive.
I'd planned to update current_stack when jumping to the emergency stack,
and use the same (initial) bounds detection, requiring the emergency
stack to be the same size. In the case of an emergency stack overflow,
we'd go to a (stackless) wfi/wfe loop.
However, I deleted bits of that code while trying to debug an unrelated
issue, and didn't restore it.
I guess it depends on whether we want to try to handle that case.
Thanks,
Mark.