Re: [PATCH 1/2] kernel/fork: Add support for stack-end guard page

From: Mark Rutland
Date: Wed Jul 24 2019 - 07:21:08 EST


On Wed, Jul 24, 2019 at 11:11:49AM +0200, Dmitry Vyukov wrote:
> On Tue, Jul 23, 2019 at 6:41 PM Mark Rutland <mark.rutland@xxxxxxx> wrote:
> >
> > On Fri, Jul 19, 2019 at 03:28:17PM +0200, Marco Elver wrote:
> > > Enabling STACK_GUARD_PAGE helps catching kernel stack overflows immediately
> > > rather than causing difficult-to-diagnose corruption. Note that, unlike
> > > virtually-mapped kernel stacks, this will effectively waste an entire page of
> > > memory; however, this feature may provide extra protection in cases that cannot
> > > use virtually-mapped kernel stacks, at the cost of a page.
> > >
> > > The motivation for this patch is that KASAN cannot use virtually-mapped kernel
> > > stacks to detect stack overflows. An alternative would be implementing support
> > > for vmapped stacks in KASAN, but would add significant extra complexity.
> >
> > Do we have an idea as to how much additional complexity?
>
> We would need to map/unmap shadow for vmalloc region on stack
> allocation/deallocation. We may need to track shadow pages that cover
> both stack and an unused memory, or 2 different stacks, which are
> mapped/unmapped at different times. This may have some concurrency
> concerns. Not sure what about page tables for other CPU, I've seen
> some code that updates pages tables for vmalloc region lazily on page
> faults. Not sure what about TLBs. Probably also some problems that I
> can't thought about now.

Ok. So this looks big, we this hasn't been prototyped, so we don't have
a concrete idea. I agree that concurrency is likely to be painful. :)

[...]

> > > diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
> > > index 288b065955b7..b218b5713c02 100644
> > > --- a/arch/x86/include/asm/page_64_types.h
> > > +++ b/arch/x86/include/asm/page_64_types.h
> > > @@ -12,8 +12,14 @@
> > > #define KASAN_STACK_ORDER 0
> > > #endif
> > >
> > > +#ifdef CONFIG_STACK_GUARD_PAGE
> > > +#define STACK_GUARD_SIZE PAGE_SIZE
> > > +#else
> > > +#define STACK_GUARD_SIZE 0
> > > +#endif
> > > +
> > > #define THREAD_SIZE_ORDER (2 + KASAN_STACK_ORDER)
> > > -#define THREAD_SIZE (PAGE_SIZE << THREAD_SIZE_ORDER)
> > > +#define THREAD_SIZE ((PAGE_SIZE << THREAD_SIZE_ORDER) - STACK_GUARD_SIZE)
> >
> > I'm pretty sure that common code relies on THREAD_SIZE being a
> > power-of-two. I also know that if we wanted to enable this on arm64 that
> > would very likely be a requirement.
> >
> > For example, in kernel/trace/trace_stack.c we have:
> >
> > | this_size = ((unsigned long)stack) & (THREAD_SIZE-1);
> >
> > ... and INIT_TASK_DATA() allocates the initial task stack using
> > THREAD_SIZE, so that may require special care, as it might not be sized
> > or aligned as you expect.
>
> We've built it, booted it, stressed it, everything looked fine... that
> should have been a build failure.

I think it's been an implicit assumption for so long that no-one saw the need
for built-time assertions where they depend on it.

I also suspect that in practice there are paths that you won't have
stressed in your environment, e.g. in the ACPI wakeup path where we end
up calling:

/* Unpoison the stack for the current task beyond a watermark sp value. */
asmlinkage void kasan_unpoison_task_stack_below(const void *watermark)
{
/*
* Calculate the task stack base address. Avoid using 'current'
* because this function is called by early resume code which hasn't
* yet set up the percpu register (%gs).
*/
void *base = (void *)((unsigned long)watermark & ~(THREAD_SIZE - 1));

kasan_unpoison_shadow(base, watermark - base);
}

> Is it a property that we need to preserve? Or we could fix the uses
> that assume power-of-2?

Generally, I think that those can be fixed up. Someone just needs to dig
through how THREAD_SIZE and THREAD_SIZE_ORDER are used to generate or
manipulate addresses.

For local-task stuff, I think it's easy to rewrite in terms of
task_stack_page(), but I'm not entirely sure what we'd do for cases
where we look at another task, e.g.

static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
unsigned long prev_depth = THREAD_SIZE -
(task->prev_lowest_stack & (THREAD_SIZE - 1));
unsigned long depth = THREAD_SIZE -
(task->lowest_stack & (THREAD_SIZE - 1));

seq_printf(m, "previous stack depth: %lu\nstack depth: %lu\n",
prev_depth, depth);
return 0;
}

... as I'm not sure of the lifetime of task->stack relative to task. I
know that with THREAD_INFO_IN_TASK the stack can be freed while the task
is still live.

Thanks,
Mark.